topic Re: Custom line separator in Data Engineering

Custom line separator

ArvindShyamsund — Wed, 29 Nov 2017 04:54:40 GMT

I see that https://github.com/apache/spark/pull/18581 will enable defining custom Line Separators for many sources, including CSV. Apart from waiting on this PR to make it into the main Databricks runtime, is there any other alternative to support different line separator? Such as directly setting HadoopFileLinesReader configuration?

Re: Custom line separator

User16857281974 — Wed, 29 Nov 2017 15:46:24 GMT

Currently, the only known option is to fix the line separator before beginning your standard processing.

In that vein, one option I can think of is to use SparkContext.wholeTextFiles(..) to read in an RDD, split the data by the customs line separator and then from there are a couple of additional choices:

Write the file back out with the new line separators.
Convert the RDD to a DataFrame with a call like rdd.toDF() and resume processing.

The major disadvantage to this is the size of each file. The call to wholeTextFiles(..) will load each file as a single partition which means we could very easily consume all the available RAM.

The second option I can think of would be to perform the same operation above, but from the driver with standard file IO (if possible). The Scala/Python libraries for file manipulation are pretty straight forward allowing you to read one chunk of data, clean it up and write it back out to a new file. Again this is a very ugly solution and is not as practical if you are working from most data stores like S3, HDFS, etc.

Re: Custom line separator

User16857281974 — Wed, 29 Nov 2017 15:47:16 GMT

I'll try to run this question by other instructors and/or engineers to see if there is some unknown/undocumented solution.

Re: Custom line separator

User16844369513 — Wed, 29 Nov 2017 16:38:53 GMT

From the referenced PR, I assume that we’re talking about processing files that use something other than \n to delimit lines—e.g., \r, or \r\n. Since CSV files are assumed to be text files, and since Java uses a platform-specific notion of a line separator (System.getProperty("line.separator")), it might be possible to change that system property. However, I’m not 100% positive that’ll work, without either (a) digging into the source for the CSV reader, or (b) experimenting with changing that property (which needs to be propagated to the Executors).

I'd definitely stay away from wholeTextFiles(), for the reasons Jacob Parr (@SireInsectus) mentions.

Re: Custom line separator

Anonymous — Wed, 29 Nov 2017 17:40:06 GMT

Brian's answer above is probably the simplest. Here are some other options in case line.seperator doesn't do the trick:

Option: Custom CSV Reader

Modify the CSV reader and upload it as a library. Here are the steps:

Fork the current CSV reader from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
Merge the pull request into your fork
Change the package name
Load it as a JAR library
Access your custom reader using spark.read.format("com.example.spark.datasources.csv.CSVFileFormat").load(filename)

Option: Fix the original file...

Create a program that reads the file in as a byte stream, repairs the bytes, and writes out a new, repaired file. It's easiest if you do it without parallelizing, but you could parallelize if you wanted by breaking the file into 1 gb chunks and seeking ahead to the relevant section to process those in parallel.

Re: Custom line separator

User16857281974 — Wed, 29 Nov 2017 18:46:12 GMT

So I chased down all the calls in the latest version of the source code (with @Brian Clapper's help) and it boils down to how org.apache.hadoop.util.LineReader is implemented - in short, it's hardcoded to use CR & LF if a specific record delimiter is not specified.

Re: Custom line separator

ArvindShyamsund — Wed, 29 Nov 2017 19:12:42 GMT

Thank you Doug, and everyone else who took the time to reply. I really appreciate this help!

Re: Custom line separator

ArvindShyamsund — Wed, 29 Nov 2017 23:03:51 GMT

QQ: how do I call System.getProperty from a Python notebook in Databricks? Can't seem to figure out the correct import 😞

Re: Custom line separator

User16844369513 — Wed, 29 Nov 2017 23:05:43 GMT

Yeah, you can't really do that easily. That's a JVM thing. The actual code that reads the CSV runs on the JVM (and is written in Scala).

If you're working in Python, your best bet really is to write your own adapter (or wait until the PR shows up in Spark). And you'll have to write that adapter in Scala or Java.

Re: Custom line separator

Anonymous — Wed, 29 Nov 2017 23:52:05 GMT

spark._jvm.java.lang.System.getProperty("line.separator")
spark._jvm.java.lang.System.setProperty("line.separator", "\n")

Re: Custom line separator

User16857281974 — Thu, 30 Nov 2017 00:00:33 GMT

That's cool. I didn't know we could do that from Python.... However, remember that org.apache.hadoop.util.LineReader is hardcoded to use CR & LFs - I just verified that this morning. It does not check any system properties like all "good" libraries would.

Re: Custom line separator

User16844369513 — Thu, 30 Nov 2017 01:31:12 GMT

You can do that, but you're stepping inside "implementation detail" land. This is definitely doable, but you're using a non-public API, so caveat programmer.

Re: Custom line separator

DanielTomes — Wed, 07 Mar 2018 22:46:50 GMT

You can use newAPIHadoopFile

SCALA

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "#")
val log_df = sc.newAPIHadoopFile("path/to/file", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString).toDF()

PYTHON

log_rdd = sc.newAPIHadoopFile("/path/to/file", "org.apache.hadoop.mapreduce.lib.input.TextInputFormat", "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text", conf).map(lambda x: [x[1]])