The key thing to remember is that in Spark RDD/DF are immutable. So once created you can not change them.

However there are many situation where you want the column type to be different.

E.g By default Spark comes with cars.csv where year column is a String. If you want to use a datetime function you need the column as a Datetime. You can change the column type from string to date in a new dataframe.

Here is an example to change the column type.

val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "file:///Users/vshukla/projects/spark/sql/core/src/test/resources/cars.csv", "header" -> "true"))
df2.printSchema()
val df4 = df2.withColumn("year2", 'year.cast("Int")).select('year2 as 'year, 'make, 'model, 'comment)
df4.printSchema
 

Hi,

Its not working for me ..Unable to cast the column data type.Here is my code.I am getting only null values.Thanks!

val df4 = businessDF.withColumn("Estimated Gross Loss1", $"Estimated Gross Loss".cast("Int")).select($"Estimated Gross Loss1" as "Estimated Gross Loss")

Nice article. I am looking for a solution, Instead of specifying a each column separately. is there any way we can dynamically handle the datatype. Lets say in my dataframe i have 50 columns out of 8 are decimals and need to convert all decimal datatype to double. Without specify a column name can we do that directly?
I am also looking for something like this. I need to convert all my date datatypes to varchar in a dataframe having more than 300 columns. 

Other Popular Courses