Databricks Community

XinZodl · ‎11-03-2017

Hi!

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:

Line1field1;Line1field2.1 \

Line1field2.2;Line1field3;

Line2FIeld1;Line2field2;Line2field3;

I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")

However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:

[Line1field1,Line1field2.1,null] (3 fields)

[Line1field.2,Line1field3,null] (3 fields)

[Line2FIeld1,Line2field2,Line2field3;] (3 fields)

The expected result is:

[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)

[Line2FIeld1,Line2field2,Line2field3] (3 fields)

(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)

Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)

My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work.Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?

Thanks!

XinZodl · ‎11-07-2017

Solution is "sparkContext.wholeTextFiles"

View solution in original post

User16857281974 · ‎11-07-2017

Spark 2.2.0 adds support for parsing multi-line CSV files which is what I understand you to be describing. However, without quotes, the parser won't know how to distinguish a new-line in the middle of a field vs a new-line at the end of a record.

And just to make sure that assertion is true, I ran the following tests which read the CSV file in properly:

val jsonLines = """"Line1field1";"Line1field2.1 \
Line1field2.2";"Line1field3";
"Line2FIeld1";"Line2field2";"Line2field3";"""
<br>val fileName = "/tmp/whatever.csv"
dbutils.fs.put(fileName, jsonLines, true)<br><br>val df = spark.read
  .option("sep", ";")
  .option("quote", "\"")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

But the following test does not work:

val jsonLines = """Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;"""
 
val fileName = "/tmp/jdp/q12593.json"
dbutils.fs.put(fileName, jsonLines, true)
 
val df = spark.read
  .option("sep", ";")
  .option("quote", "")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

XinZodl · ‎11-07-2017

Solution is "sparkContext.wholeTextFiles"

User16857281974 · ‎11-08-2017

Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs: