- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-03-2017 12:01 AM
Hi!
I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:
Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;
I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")
However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:
[Line1field1,Line1field2.1,null] (3 fields)
[Line1field.2,Line1field3,null] (3 fields)
[Line2FIeld1,Line2field2,Line2field3;] (3 fields)
The expected result is:
[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)
[Line2FIeld1,Line2field2,Line2field3] (3 fields)
(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)
Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)
My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work.Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?
Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2017 11:59 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2017 11:43 PM
Spark 2.2.0 adds support for parsing multi-line CSV files which is what I understand you to be describing. However, without quotes, the parser won't know how to distinguish a new-line in the middle of a field vs a new-line at the end of a record.
And just to make sure that assertion is true, I ran the following tests which read the CSV file in properly:
val jsonLines = """"Line1field1";"Line1field2.1 \
Line1field2.2";"Line1field3";
"Line2FIeld1";"Line2field2";"Line2field3";"""
<br>val fileName = "/tmp/whatever.csv"
dbutils.fs.put(fileName, jsonLines, true)<br><br>val df = spark.read
.option("sep", ";")
.option("quote", "\"")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(fileName)
But the following test does not work:
val jsonLines = """Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;"""
val fileName = "/tmp/jdp/q12593.json"
dbutils.fs.put(fileName, jsonLines, true)
val df = spark.read
.option("sep", ";")
.option("quote", "")
.option("multiLine", "true")
.option("inferSchema", "true")
.csv(fileName)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2017 11:59 PM
Solution is "sparkContext.wholeTextFiles"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2017 04:06 PM
Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs:
- Small files are preferred, large file is also allowable, but may cause bad performance.

