cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Issues with UTF-16 files and unicode characters

DominicRobinson
New Contributor II

Can someone please offer some insight - I've spent days trying to solve this issue

We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and therefore our source contains lots of unicode characters. The encoding of the files cannot be changed, nor can the format.

The issue I'm seeing quite frequently is that these unicode characters are not getting displayed correctly via the spark interpreter - additionally this problem causes the tab delimeter to be escaped, ultimately resulting in subsequent columns shifting to the left.

A prime example of this is the euro symbol U+20AC โ‚ฌ, the symbol displays fine when opened in Notepad++, vi or pretty much any unicode capable editor.

However when displayed in a dataframe I see ""ยฌโ€ข", I thought this might be a problem with the way our application is encoding files, but no it seems to extend to any UTF-16LE file encoded in Windows. I can reproduce this every single time by simply typing the euro symbol into Windows notepad saving the file with UTF-16 encoding and loading it into databricks.

This is causing us real problems - can anyone help?

Sample code:

val df = spark.read
.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", "\\t")
      .option("endian", "little")
      .option("encoding", "UTF-16")
      .option("charset", "UTF-16")
      .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
      .option("codec", "gzip")
      .option("sep", "\t")
.csv("mnt/adls/test/cu100.gz")
display(df)

It somehow seems like it might be a problem with the csv connector, because:

val test = Seq("โ‚ฌ")
val t = test.toDF
display(t)

Works absoloutely fine

4 REPLIES 4

User16817872376
New Contributor III

hi @Dominic Robinsonโ€‹  , my colleague tells me that the CSV source should support UTF-16LE and UTF-16BE, but not plain UTF-16. It may be helpful to look at the test suite for the CSV source - it has simple examples of what is and isn't possible. It seems like you are saying that should be covered by UTF-16LE - if so, you may want to verify that there isn't a discrepancy caused by creating the file in Windows. If I recall correctly, Windows formats text files slightly differently than Unix/Mac does.

Side note, you should not use "com.databricks.spark.csv" anymore. Spark has a built-in csv data source as of Spark 2.0 and the Databricks package is no longer updated.

DominicRobinson
New Contributor II

It can't read the a simple one column text file with the euro symbol - it doesn't seem to be a windows encoding issue either as I've written a file using vi on Fedora:

Here is a very simple example file:

https://codiad.dcrdev.com/workspace/Workbin/test1.txt

User16817872376
New Contributor III

hi @Dominic Robinson I'm unable to create a simple reproduction of this issue. I was able to write out a file with the Euro symbol as the column using dataframe.write.csv(path), and the symbol was fine when I read the file back in using spark.read.csv(path). I think you are correct that the problem is the interaction between the csv source and whatever is producing your files.

Did you try this out with the built-in csv source yet?

If you are continuing to have problems, please raise a support ticket with Databricks. It could be a bug, or it could be your particular use case is unsupported and could be added to the csv source by Databricks.

User16817872376
New Contributor III

You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group