cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I specify column's data type with spark dataframes?

Raie
New Contributor III

What I am doing:

spark_df = spark.createDataFrame(dfnew)

spark_df.write.saveAsTable("default.test_table", index=False, header=True)

This automatically detects the datatypes and is working right now. BUT, what if the datatype cannot be detected or detects wrong? Mostly concerned about doubles, ints, bigints.

I tested casting but it doesnt work on databricks:

spark_df = spark.createDataFrame(dfnew.select(dfnew("Year").cast(IntegerType).as("Year")))

Is there a way to feed a DDL to spark dataframe for databricks? Should I not use spark to create the table?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

  • just create table earlier and set column types (CREATE TABLE ... LOCATION ( path path)
  • in dataframe you need to have corresponding data types which you can make using cast syntax, just your syntax is incorrect, here is example of correct syntax:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import *
 
dfnew = spark.createDataFrame([("2022",), ("2021",), ("2020",)], ["Year"])
dfnew = dfnew.withColumn("Year", col("Year").cast(IntegerType()))

View solution in original post

3 REPLIES 3

RKNutalapati
Valued Contributor

Hi @Raie A​  : It will create data frame with wider dataTypes for example Long for (Int/BigInt) etc.. It depends on the use case. If you have created a table and want to overwrite/append data to that then you need to explicitly cast as per your column DataType.

one option is by creating a data class and convert DF as DS

case class person(id: Int, name: String)

dfnew.as[person]

Hubert-Dudek
Esteemed Contributor III

  • just create table earlier and set column types (CREATE TABLE ... LOCATION ( path path)
  • in dataframe you need to have corresponding data types which you can make using cast syntax, just your syntax is incorrect, here is example of correct syntax:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import *
 
dfnew = spark.createDataFrame([("2022",), ("2021",), ("2020",)], ["Year"])
dfnew = dfnew.withColumn("Year", col("Year").cast(IntegerType()))

Raie
New Contributor III

Thanks @Hubert Dudek​  I changed the syntax and imported all the data types and casting is working!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.