topic Re: Issues with UTF-16 files and unicode characters in Data Engineering

Issues with UTF-16 files and unicode characters

DominicRobinson — Tue, 11 Dec 2018 20:13:13 GMT

Can someone please offer some insight - I've spent days trying to solve this issue

We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and therefore our source contains lots of unicode characters. The encoding of the files cannot be changed, nor can the format.

The issue I'm seeing quite frequently is that these unicode characters are not getting displayed correctly via the spark interpreter - additionally this problem causes the tab delimeter to be escaped, ultimately resulting in subsequent columns shifting to the left.

A prime example of this is the euro symbol U+20AC €, the symbol displays fine when opened in Notepad++, vi or pretty much any unicode capable editor.

However when displayed in a dataframe I see ""¬•", I thought this might be a problem with the way our application is encoding files, but no it seems to extend to any UTF-16LE file encoded in Windows. I can reproduce this every single time by simply typing the euro symbol into Windows notepad saving the file with UTF-16 encoding and loading it into databricks.

This is causing us real problems - can anyone help?

Sample code:

val df = spark.read
.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", "\\t")
      .option("endian", "little")
      .option("encoding", "UTF-16")
      .option("charset", "UTF-16")
      .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
      .option("codec", "gzip")
      .option("sep", "\t")
.csv("mnt/adls/test/cu100.gz")
display(df)

It somehow seems like it might be a problem with the csv connector, because:

val test = Seq("€")
val t = test.toDF
display(t)

Works absoloutely fine

Re: Issues with UTF-16 files and unicode characters

User16817872376 — Wed, 12 Dec 2018 00:19:46 GMT

hi @Dominic Robinson , my colleague tells me that the CSV source should support UTF-16LE and UTF-16BE, but not plain UTF-16. It may be helpful to look at the test suite for the CSV source - it has simple examples of what is and isn't possible. It seems like you are saying that should be covered by UTF-16LE - if so, you may want to verify that there isn't a discrepancy caused by creating the file in Windows. If I recall correctly, Windows formats text files slightly differently than Unix/Mac does.

Side note, you should not use "com.databricks.spark.csv" anymore. Spark has a built-in csv data source as of Spark 2.0 and the Databricks package is no longer updated.

Re: Issues with UTF-16 files and unicode characters

DominicRobinson — Wed, 12 Dec 2018 10:32:45 GMT

It can't read the a simple one column text file with the euro symbol - it doesn't seem to be a windows encoding issue either as I've written a file using vi on Fedora:

Here is a very simple example file:

https://codiad.dcrdev.com/workspace/Workbin/test1.txt

Re: Issues with UTF-16 files and unicode characters

User16817872376 — Wed, 12 Dec 2018 22:04:04 GMT

hi @Dominic Robinson I'm unable to create a simple reproduction of this issue. I was able to write out a file with the Euro symbol as the column using dataframe.write.csv(path), and the symbol was fine when I read the file back in using spark.read.csv(path). I think you are correct that the problem is the interaction between the csv source and whatever is producing your files.

Did you try this out with the built-in csv source yet?

If you are continuing to have problems, please raise a support ticket with Databricks. It could be a bug, or it could be your particular use case is unsupported and could be added to the csv source by Databricks.

Re: Issues with UTF-16 files and unicode characters

User16817872376 — Wed, 12 Dec 2018 22:05:09 GMT

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.