cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to infer csv schema default all columns like string using spark- csv?

Jasam
New Contributor

I am using spark- csv utility, but I need when it infer schema all columns be transform in string columns by default.

Thanks in advance.

3 REPLIES 3

User16789201666
Contributor II
Contributor II

You can manually specify schema, e.g. from (https://github.com/databricks/spark-csv):

import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true)))

val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")

val selectedData = df.select("year", "model") selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")

vadeka
New Contributor II

I was solving the same issue, that I wanted all the columns as text and deal with correct cast later which I have solved by recasting all the column to string after I've inferred the schema. I'm not sure if it's efficient, but it works.

#pyspark path = '...' df = spark.read \ .option("inferschema", "true") \ .csv(df)

for column in df.columns: df= df.withColumn(column,df[column].cast('string'))

then you have to read again with changed schema

f = spark.read.option("schema", df.schema).csv(df)

This however doesn't deal with nested columns, though csv doesn't create any nested structs, I hope.

jhoop2002
New Contributor II

@peyman what if I don't want to manually specify the schema?

For example, I have a vendor that can't build a valid .csv file. I just need to import it somewhere so I can explore the data and find the errors.

Just like the original author's question? How do I tell Spark to read all columns as string?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!