topic Re: How to parse a file with newline character, escaped with \ and not quoted in Data Engineering

How to parse a file with newline character, escaped with \ and not quoted

XinZodl — Fri, 03 Nov 2017 07:01:16 GMT

Hi!

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:

Line1field1;Line1field2.1 \

Line1field2.2;Line1field3;

Line2FIeld1;Line2field2;Line2field3;

I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")

However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:

[Line1field1,Line1field2.1,null] (3 fields)

[Line1field.2,Line1field3,null] (3 fields)

[Line2FIeld1,Line2field2,Line2field3;] (3 fields)

The expected result is:

[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)

[Line2FIeld1,Line2field2,Line2field3] (3 fields)

(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)

Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)

My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work.Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?

Thanks!

Re: How to parse a file with newline character, escaped with \ and not quoted

User16857281974 — Wed, 08 Nov 2017 07:43:28 GMT

Spark 2.2.0 adds support for parsing multi-line CSV files which is what I understand you to be describing. However, without quotes, the parser won't know how to distinguish a new-line in the middle of a field vs a new-line at the end of a record.

And just to make sure that assertion is true, I ran the following tests which read the CSV file in properly:

val jsonLines = """"Line1field1";"Line1field2.1 \
Line1field2.2";"Line1field3";
"Line2FIeld1";"Line2field2";"Line2field3";"""
<br>val fileName = "/tmp/whatever.csv"
dbutils.fs.put(fileName, jsonLines, true)<br><br>val df = spark.read
  .option("sep", ";")
  .option("quote", "\"")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

But the following test does not work:

val jsonLines = """Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;"""
 
val fileName = "/tmp/jdp/q12593.json"
dbutils.fs.put(fileName, jsonLines, true)
 
val df = spark.read
  .option("sep", ";")
  .option("quote", "")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

Re: How to parse a file with newline character, escaped with \ and not quoted

XinZodl — Wed, 08 Nov 2017 07:59:09 GMT

Solution is "sparkContext.wholeTextFiles"

Re: How to parse a file with newline character, escaped with \ and not quoted

User16857281974 — Thu, 09 Nov 2017 00:06:28 GMT

Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs:

Small files are preferred, large file is also allowable, but may cause bad performance.