Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Showing results for 
Search instead for 
Did you mean: 

how to infer csv schema default all columns like string using spark- csv?

New Contributor

I am using spark- csv utility, but I need when it infer schema all columns be transform in string columns by default.

Thanks in advance.


Databricks Employee
Databricks Employee

You can manually specify schema, e.g. from (

import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true)))

val df = .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

val selectedData ="year", "model") selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")

New Contributor II

I was solving the same issue, that I wanted all the columns as text and deal with correct cast later which I have solved by recasting all the column to string after I've inferred the schema. I'm not sure if it's efficient, but it works.

#pyspark path = '...' df = \ .option("inferschema", "true") \ .csv(df)

for column in df.columns: df= df.withColumn(column,df[column].cast('string'))

then you have to read again with changed schema

f ="schema", df.schema).csv(df)

This however doesn't deal with nested columns, though csv doesn't create any nested structs, I hope.

New Contributor II

@peyman what if I don't want to manually specify the schema?

For example, I have a vendor that can't build a valid .csv file. I just need to import it somewhere so I can explore the data and find the errors.

Just like the original author's question? How do I tell Spark to read all columns as string?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now