Re: CSV Reader reads quoted fields inconsistently ...

Martinitus · ‎02-02-2024

I just opened another issue: https://issues.apache.org/jira/browse/SPARK-46959

It corrupts data even when read with mode="FAILFAST", i consider it critical, because basic stuff like this should just work!

feiyun0112 · ‎02-02-2024

You are using the escape option incorrectly

df = (spark.read
  .format("csv")
  .option("header","true")
  .option("sep",";")
  .option("encoding","ISO-8859-1")
  .option("lineSep","\r\n")
  .option("nullValue","")
  .option("quote",'"')
  #.option("escape","") 
  .load("/FileStore/1.csv")
)

df.display()



------------------
a,b,c,d
10,"100,00",Some;String,ok
20,"200,00",null,still ok
30,"300,00",also ok,null
40,"400,00",null,null

CSV Files - Spark 3.5.0 Documentation (apache.org)

Martinitus · ‎02-05-2024

Not providing the escape option would default to "\" which I do not want.

Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.

feiyun0112 · ‎02-05-2024

@Martinitus wrote:
Not providing the escape option would default to "\" which I do not want.
Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.

if no escape option, how to convert this string:

"some text";some text";some text"

Martinitus · ‎02-05-2024

either: [ 'some text', 'some text"', 'some text"' ]

alternatively: [ '"some text"', 'some text"', 'some text"' ]

probably most sane behavior would be a parser error ( with mode="FAILFAST").

just parsing garbage without warning the user is certainly not a viable option.

I am well aware of the problems with CSV formats in general, it turns out I spend a significant amount of my working time dealing with those issues. Spark is a tool that should make this easier for me, not more difficult 😞

CSV Reader reads quoted fields inconsistently in last column