CSV Reader reads quoted fields inconsistently in last column
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-02-2024 04:19 AM
I just opened another issue: https://issues.apache.org/jira/browse/SPARK-46959
It corrupts data even when read with mode="FAILFAST", i consider it critical, because basic stuff like this should just work!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-02-2024 08:51 PM
You are using the escape option incorrectly
df = (spark.read
.format("csv")
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
#.option("escape","")
.load("/FileStore/1.csv")
)
df.display()
------------------
a,b,c,d
10,"100,00",Some;String,ok
20,"200,00",null,still ok
30,"300,00",also ok,null
40,"400,00",null,null
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-05-2024 12:50 AM
Not providing the escape option would default to "\" which I do not want.
Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-05-2024 01:17 AM
@Martinitus wrote:Not providing the escape option would default to "\" which I do not want.
Also, if I provide an invalid option, then I expect an error when doing so, not corrupted data.
if no escape option, how to convert this string:
"some text";some text";some text"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-05-2024 03:56 AM - edited 02-05-2024 03:59 AM
either: [ 'some text', 'some text"', 'some text"' ]
alternatively: [ '"some text"', 'some text"', 'some text"' ]
probably most sane behavior would be a parser error ( with mode="FAILFAST").
just parsing garbage without warning the user is certainly not a viable option.
I am well aware of the problems with CSV formats in general, it turns out I spend a significant amount of my working time dealing with those issues. Spark is a tool that should make this easier for me, not more difficult 😞