<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to parse a file with newline character, escaped with \ and not quoted in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29090#M20847</link>
    <description>&lt;P&gt;Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Small files are preferred, large file is also allowable, but may cause bad performance.&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Thu, 09 Nov 2017 00:06:28 GMT</pubDate>
    <dc:creator>User16857281974</dc:creator>
    <dc:date>2017-11-09T00:06:28Z</dc:date>
    <item>
      <title>How to parse a file with newline character, escaped with \ and not quoted</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29087#M20844</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi!&lt;/P&gt;
&lt;P&gt;I am facing an issue when reading and parsing a CSV file. Some records have a newline symbol, "escaped" by a \, and that record not being quoted. The file might look like this: &lt;/P&gt;
&lt;P&gt;Line1field1;Line1field2.1 \&lt;/P&gt;
&lt;P&gt;Line1field2.2;Line1field3;&lt;/P&gt;
&lt;P&gt;Line2FIeld1;Line2field2;Line2field3;&lt;/P&gt;
&lt;P&gt;I've tried to read it using sc.textFile("file.csv") and using sqlContext.read.format("..databricks..").option("escape/delimiter/...").load("file.csv")&lt;/P&gt;
&lt;P&gt;However doesn't matter how I read it, a record/line/row is created when "\ \n" si reached. So, instead of having 2 records from the previous file, I am getting three:&lt;/P&gt;
&lt;P&gt;[Line1field1,Line1field2.1,null] (3 fields)&lt;/P&gt;
&lt;P&gt;[Line1field.2,Line1field3,null] (3 fields)&lt;/P&gt;
&lt;P&gt;[Line2FIeld1,Line2field2,Line2field3;] (3 fields)&lt;/P&gt;
&lt;P&gt;The expected result is:&lt;/P&gt;
&lt;P&gt;[Line1field1,Line1field2.1 Line1field.2,Line1field3] (3 fields)&lt;/P&gt;
&lt;P&gt;[Line2FIeld1,Line2field2,Line2field3] (3 fields)&lt;/P&gt;
&lt;P&gt;(How the newline symbol is saved in the record is not that important, main issue is having the correct set of records/lines)&lt;/P&gt;
&lt;P&gt;Any ideas of how to be able to do that? Without modifying the original file and preferably without any post/re processing (for example reading the file and filtering any lines with a lower number of fields than expected and the concatenating them could be a solution, but not at all optimal)&lt;/P&gt;
&lt;P&gt;My hope was to use databrick's csv parser to set the escape character to \ (which is supposed to be by default), but that didn't work.Should I somehow extend the parser and edit something, creating my own parser? Which would be the best solution?&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Nov 2017 07:01:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29087#M20844</guid>
      <dc:creator>XinZodl</dc:creator>
      <dc:date>2017-11-03T07:01:16Z</dc:date>
    </item>
    <item>
      <title>Re: How to parse a file with newline character, escaped with \ and not quoted</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29088#M20845</link>
      <description>&lt;P&gt;Spark 2.2.0 adds support for parsing multi-line CSV files which is what I understand you to be describing. However, without quotes, the parser won't know how to distinguish a new-line in the middle of a field vs a new-line at the end of a record.&lt;/P&gt;&lt;P&gt;And just to make sure that assertion is true, I ran the following tests which read the CSV file in properly:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val jsonLines = """"Line1field1";"Line1field2.1 \
Line1field2.2";"Line1field3";
"Line2FIeld1";"Line2field2";"Line2field3";"""
&amp;lt;br&amp;gt;val fileName = "/tmp/whatever.csv"
dbutils.fs.put(fileName, jsonLines, true)&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;val df = spark.read
  .option("sep", ";")
  .option("quote", "\"")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But the following test does not work:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val jsonLines = """Line1field1;Line1field2.1 \
Line1field2.2;Line1field3;
Line2FIeld1;Line2field2;Line2field3;"""
 
val fileName = "/tmp/jdp/q12593.json"
dbutils.fs.put(fileName, jsonLines, true)
 
val df = spark.read
  .option("sep", ";")
  .option("quote", "")
  .option("multiLine", "true")
  .option("inferSchema", "true")
  .csv(fileName)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 07:43:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29088#M20845</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-08T07:43:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to parse a file with newline character, escaped with \ and not quoted</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29089#M20846</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Solution is "sparkContext.wholeTextFiles"&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Nov 2017 07:59:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29089#M20846</guid>
      <dc:creator>XinZodl</dc:creator>
      <dc:date>2017-11-08T07:59:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to parse a file with newline character, escaped with \ and not quoted</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29090#M20847</link>
      <description>&lt;P&gt;Nothing wrong with reverting to using the RDD API, but the one caution here would be wary of the size of the files. Because each file is read entirely as a single record, large files can cause significant performance issues if it doesn't crash the executors. To quote the API docs:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Small files are preferred, large file is also allowable, but may cause bad performance.&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 09 Nov 2017 00:06:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-parse-a-file-with-newline-character-escaped-with-and-not/m-p/29090#M20847</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-09T00:06:28Z</dc:date>
    </item>
  </channel>
</rss>

