<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Custom line separator in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29028#M20785</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'll try to run this question by other instructors and/or engineers to see if there is some unknown/undocumented solution.&lt;/P&gt;</description>
    <pubDate>Wed, 29 Nov 2017 15:47:16 GMT</pubDate>
    <dc:creator>User16857281974</dc:creator>
    <dc:date>2017-11-29T15:47:16Z</dc:date>
    <item>
      <title>Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29026#M20783</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I see that &lt;A href="https://github.com/apache/spark/pull/18581" target="test_blank"&gt;https://github.com/apache/spark/pull/18581&lt;/A&gt; will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support different line separator? Such as directly setting HadoopFileLinesReader configuration?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 04:54:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29026#M20783</guid>
      <dc:creator>ArvindShyamsund</dc:creator>
      <dc:date>2017-11-29T04:54:40Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29027#M20784</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Currently, the only known option is to fix the line separator before beginning your standard processing. &lt;/P&gt;&lt;P&gt;In that vein, one option I can think of is to use &lt;B&gt;SparkContext.wholeTextFiles(..)&lt;/B&gt; to read in an RDD, split the data by the customs line separator and then from there are a couple of additional choices:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Write the file back out with the new line separators.&lt;/LI&gt;&lt;LI&gt;Convert the RDD to a DataFrame with a call like &lt;B&gt;rdd.toDF()&lt;/B&gt; and resume processing.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The major disadvantage to this is the size of each file. The call to &lt;B&gt;wholeTextFiles(..) &lt;/B&gt;will load each file as a single partition which means we could very easily consume all the available RAM.&lt;/P&gt;&lt;P&gt;The second option I can think of would be to perform the same operation above, but from the driver with standard file IO (if possible). The Scala/Python libraries for file manipulation are pretty straight forward allowing you to read one chunk of data, clean it up and write it back out to a new file. Again this is a very ugly solution and is not as practical if you are working from most data stores like S3, HDFS, etc.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 15:46:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29027#M20784</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-29T15:46:24Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29028#M20785</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'll try to run this question by other instructors and/or engineers to see if there is some unknown/undocumented solution.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 15:47:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29028#M20785</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-29T15:47:16Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29029#M20786</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;From the referenced PR, I assume that we’re talking about processing files that use something other than &lt;B&gt;\n&lt;/B&gt; to delimit lines—e.g., &lt;B&gt;\r&lt;/B&gt;, or &lt;B&gt;\r\n.&lt;/B&gt; Since CSV files are assumed to be text files, and since Java uses a platform-specific notion of a line separator (&lt;B&gt;System.getProperty("line.separator")&lt;/B&gt;), it &lt;I&gt;might&lt;/I&gt; be possible to change that system property. However, I’m not 100% positive that’ll work, without either (a) digging into the source for the CSV reader, or (b) experimenting with changing that property (which needs to be propagated to the Executors).&lt;/P&gt;
&lt;P&gt;I'd definitely stay away from &lt;B&gt;wholeTextFiles()&lt;/B&gt;, for the reasons Jacob Parr (@SireInsectus) mentions.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 16:38:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29029#M20786</guid>
      <dc:creator>User16844369513</dc:creator>
      <dc:date>2017-11-29T16:38:53Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29030#M20787</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Brian's answer above is probably the simplest. Here are some other options in case line.seperator doesn't do the trick:&lt;/P&gt;
&lt;P&gt;Option: Custom CSV Reader&lt;/P&gt;
&lt;P&gt;Modify the CSV reader and upload it as a library. Here are the steps:&lt;/P&gt;
&lt;UL&gt;&lt;LI&gt;Fork the current CSV reader from &lt;A href="https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala" target="test_blank"&gt;https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Merge the pull request into your fork&lt;/LI&gt;&lt;LI&gt;Change the package name&lt;/LI&gt;&lt;LI&gt;Load it as a JAR library&lt;/LI&gt;&lt;LI&gt;Access your custom reader using spark.read.format("com.example.spark.datasources.csv.CSVFileFormat").load(filename)&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Option: Fix the original file...&lt;/P&gt;
&lt;P&gt;Create a program that reads the file in as a byte stream, repairs the bytes, and writes out a new, repaired file. It's easiest if you do it without parallelizing, but you could parallelize if you wanted by breaking the file into 1 gb chunks and seeking ahead to the relevant section to process those in parallel.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 17:40:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29030#M20787</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2017-11-29T17:40:06Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29031#M20788</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I chased down all the calls in the latest version of the source code (with @Brian Clapper's help) and it boils down to how org.apache.hadoop.util.&lt;B&gt;LineReader &lt;/B&gt;is implemented - in short, it's hardcoded to use CR &amp;amp; LF if a specific record delimiter is not specified.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 18:46:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29031#M20788</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-29T18:46:12Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29032#M20789</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Thank you Doug, and everyone else who took the time to reply. I really appreciate this help!&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 19:12:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29032#M20789</guid>
      <dc:creator>ArvindShyamsund</dc:creator>
      <dc:date>2017-11-29T19:12:42Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29033#M20790</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;QQ: how do I call System.getProperty from a Python notebook in Databricks? Can't seem to figure out the correct import &lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 23:03:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29033#M20790</guid>
      <dc:creator>ArvindShyamsund</dc:creator>
      <dc:date>2017-11-29T23:03:51Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29034#M20791</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Yeah, you can't really do that easily. That's a JVM thing. The actual code that reads the CSV runs on the JVM (and is written in Scala).&lt;/P&gt;
&lt;P&gt;If you're working in Python, your best bet really is to write your own adapter (or wait until the PR shows up in Spark). And you'll have to write that adapter in Scala or Java.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 23:05:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29034#M20791</guid>
      <dc:creator>User16844369513</dc:creator>
      <dc:date>2017-11-29T23:05:43Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29035#M20792</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;spark._jvm.java.lang.System.getProperty("line.separator")
spark._jvm.java.lang.System.setProperty("line.separator", "\n")&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2017 23:52:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29035#M20792</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2017-11-29T23:52:05Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29036#M20793</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;That's cool. I didn't know we could do that from Python.... However, remember that &lt;B&gt;org.apache.hadoop.util.LineReader&lt;/B&gt; is hardcoded to use CR &amp;amp; LFs - I just verified that this morning. It does not check any system properties like all "good" libraries would.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2017 00:00:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29036#M20793</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-30T00:00:33Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29037#M20794</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can do that, but you're stepping inside "implementation detail" land. This is definitely doable, but you're using a non-public API, so &lt;I&gt;caveat programmer.&lt;/I&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Nov 2017 01:31:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29037#M20794</guid>
      <dc:creator>User16844369513</dc:creator>
      <dc:date>2017-11-30T01:31:12Z</dc:date>
    </item>
    <item>
      <title>Re: Custom line separator</title>
      <link>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29038#M20795</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can use newAPIHadoopFile&lt;/P&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;B&gt;SCALA&lt;/B&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "#")
val log_df = sc.newAPIHadoopFile("path/to/file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString).toDF()
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;B&gt;PYTHON&lt;/B&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;log_rdd = sc.newAPIHadoopFile("/path/to/file", "org.apache.hadoop.mapreduce.lib.input.TextInputFormat", "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text", conf).map(lambda x: [x[1]])&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Mar 2018 22:46:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/custom-line-separator/m-p/29038#M20795</guid>
      <dc:creator>DanielTomes</dc:creator>
      <dc:date>2018-03-07T22:46:50Z</dc:date>
    </item>
  </channel>
</rss>

