cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

CSV Reader reads quoted fields inconsistently in last column

Martinitus
New Contributor III

I just opened another issue: https://issues.apache.org/jira/browse/SPARK-46959

It corrupts data even when read with mode="FAILFAST", i consider it critical, because basic stuff like this  should just work!

4 REPLIES 4

feiyun0112
Contributor III

You are using the escape option incorrectly

 

 

df = (spark.read
  .format("csv")
  .option("header","true")
  .option("sep",";")
  .option("encoding","ISO-8859-1")
  .option("lineSep","\r\n")
  .option("nullValue","")
  .option("quote",'"')
  #.option("escape","") 
  .load("/FileStore/1.csv")
)

df.display()



------------------
a,b,c,d
10,"100,00",Some;String,ok
20,"200,00",null,still ok
30,"300,00",also ok,null
40,"400,00",null,null

 

 

CSV Files - Spark 3.5.0 Documentation (apache.org)

Martinitus
New Contributor III

Not providing the escape option would default to "\" which I do not want.

Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.


@Martinitus wrote:

Not providing the escape option would default to "\" which I do not want.

Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.


if no escape option, how to convert this string:

"some text";some text";some text"

 

Martinitus
New Contributor III

either:  [ 'some text', 'some text"', 'some text"' ]

alternatively: [ '"some text"', 'some text"', 'some text"' ]

probably most sane behavior would be a parser error ( with mode="FAILFAST").

just parsing garbage without warning the user is certainly not a viable option.

I am well aware of the problems with CSV formats in general, it turns out I spend a significant amount of my working time dealing with those issues. Spark is a tool that should make this easier for me, not more difficult ๐Ÿ˜ž

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!