cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to parse a file with newline character, escaped with \ and not quoted

XinZodl
New Contributor III

Hi!

I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this:

Line1field1;Line1field2.1 \

Line1field2.2;Line1field3;

Line2FIeld1;Line2field2;Line2field3;

I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")

However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:

[Line1field1,Line1field2.1,null] (3 fields)

[Line1field.2,Line1field3,null] (3 fields)

[Line2FIeld1,Line2field2,Line2field3;] (3 fields)

The expected result is:

[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)

[Line2FIeld1,Line2field2,Line2field3] (3 fields)

(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)

Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)

My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work.Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

XinZodl
New Contributor III

Solution is "sparkContext.wholeTextFiles"

View solution in original post

3 REPLIES 3

User16857281974
Contributor

Spark 2.2.0 adds support for parsing multi-line CSV files which is what I understand you to be describing. However, without quotes, the parser won't know how to distinguish a new-line in the middle of a field vs a new-line at the end of a record.

And just to make sure that assertion is true, I ran the following tests which read the CSV file in properly:

val jsonLines = """"Line1field1";"Line1field2.1 \
Line1field2.2";"Line1field3";
"Line2FIeld1";"Line2field2";"Line2field3";"""
<br>val fileName = "/tmp/whatever.csv"
dbutils.fs.put(fileName, jsonLines, true)<br><br>val df = spark.read
  .option("sep", ";")
  .option("quote", "\"")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

But the following test does not work:

val jsonLines = """Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;"""
 
val fileName = "/tmp/jdp/q12593.json"
dbutils.fs.put(fileName, jsonLines, true)
 
val df = spark.read
  .option("sep", ";")
  .option("quote", "")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)

XinZodl
New Contributor III

Solution is "sparkContext.wholeTextFiles"

Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs:

  • Small files are preferred, large file is also allowable, but may cause bad performance.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.