how to infer csv schema default all columns like string using spark- csv?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2016 08:17 AM
I am using spark- csv utility, but I need when it infer schema all columns be transform in string columns by default.
Thanks in advance.
- Labels:
-
Change data capture
-
CSV
-
Schema
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-22-2016 09:30 AM
You can manually specify schema, e.g. from (https://github.com/databricks/spark-csv):
import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};val sqlContext = new SQLContext(sc) val customSchema = StructType(Array( StructField("year", IntegerType, true), StructField("make", StringType, true), StructField("model", StringType, true), StructField("comment", StringType, true), StructField("blank", StringType, true)))
val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .schema(customSchema) .load("cars.csv")
val selectedData = df.select("year", "model") selectedData.write .format("com.databricks.spark.csv") .option("header", "true") .save("newcars.csv")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2018 05:57 AM
I was solving the same issue, that I wanted all the columns as text and deal with correct cast later which I have solved by recasting all the column to string after I've inferred the schema. I'm not sure if it's efficient, but it works.
#pyspark path = '...' df = spark.read \ .option("inferschema", "true") \ .csv(df)for column in df.columns: df= df.withColumn(column,df[column].cast('string'))
then you have to read again with changed schemaf = spark.read.option("schema", df.schema).csv(df)
This however doesn't deal with nested columns, though csv doesn't create any nested structs, I hope.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-19-2021 02:09 PM
@peyman what if I don't want to manually specify the schema?
For example, I have a vendor that can't build a valid .csv file. I just need to import it somewhere so I can explore the data and find the errors.
Just like the original author's question? How do I tell Spark to read all columns as string?

