cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

CSV Reader reads quoted fields inconsistently in last column

Martinitus
New Contributor III

I just opened another issue: https://issues.apache.org/jira/browse/SPARK-46959

It corrupts data even when read with mode="FAILFAST", i consider it critical, because basic stuff like this  should just work!

4 REPLIES 4

feiyun0112
Honored Contributor

You are using the escape option incorrectly

 

 

df = (spark.read
  .format("csv")
  .option("header","true")
  .option("sep",";")
  .option("encoding","ISO-8859-1")
  .option("lineSep","\r\n")
  .option("nullValue","")
  .option("quote",'"')
  #.option("escape","") 
  .load("/FileStore/1.csv")
)

df.display()



------------------
a,b,c,d
10,"100,00",Some;String,ok
20,"200,00",null,still ok
30,"300,00",also ok,null
40,"400,00",null,null

 

 

CSV Files - Spark 3.5.0 Documentation (apache.org)

Martinitus
New Contributor III

Not providing the escape option would default to "\" which I do not want.

Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.


@Martinitus wrote:

Not providing the escape option would default to "\" which I do not want.

Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.


if no escape option, how to convert this string:

"some text";some text";some text"

 

Martinitus
New Contributor III

either:  [ 'some text', 'some text"', 'some text"' ]

alternatively: [ '"some text"', 'some text"', 'some text"' ]

probably most sane behavior would be a parser error ( with mode="FAILFAST").

just parsing garbage without warning the user is certainly not a viable option.

I am well aware of the problems with CSV formats in general, it turns out I spend a significant amount of my working time dealing with those issues. Spark is a tool that should make this easier for me, not more difficult 😞

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group