<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Issues with UTF-16 files and unicode characters in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28359#M20179</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 12 Dec 2018 22:05:09 GMT</pubDate>
    <dc:creator>User16817872376</dc:creator>
    <dc:date>2018-12-12T22:05:09Z</dc:date>
    <item>
      <title>Issues with UTF-16 files and unicode characters</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28355#M20175</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Can someone please offer some insight - I've spent days trying to solve this issue&lt;/P&gt;
&lt;P&gt;We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and therefore our source contains lots of unicode characters. The encoding of the files cannot be changed, nor can the format.&lt;/P&gt;
&lt;P&gt;The issue I'm seeing quite frequently is that these unicode characters are not getting displayed correctly via the spark interpreter - additionally this problem causes the tab delimeter to be escaped, ultimately resulting in subsequent columns shifting to the left.&lt;/P&gt;
&lt;P&gt;A prime example of this is the euro symbol U+20AC €, the symbol displays fine when opened in Notepad++, vi or pretty much any unicode capable editor.&lt;/P&gt;
&lt;P&gt;However when displayed in a dataframe I see ""¬•", I thought this might be a problem with the way our application is encoding files, but no it seems to extend to any UTF-16LE file encoded in Windows. I can reproduce this every single time by simply typing the euro symbol into Windows notepad saving the file with UTF-16 encoding and loading it into databricks.&lt;/P&gt;
&lt;P&gt;This is causing us real problems - can anyone help?&lt;/P&gt;
&lt;P&gt;Sample code:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;val df = spark.read
.format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", "\\t")
      .option("endian", "little")
      .option("encoding", "UTF-16")
      .option("charset", "UTF-16")
      .option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
      .option("codec", "gzip")
      .option("sep", "\t")
.csv("mnt/adls/test/cu100.gz")
display(df)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;It somehow seems like it might be a problem with the csv connector, because:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;val test = Seq("€")
val t = test.toDF
display(t)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Works absoloutely fine&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 Dec 2018 20:13:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28355#M20175</guid>
      <dc:creator>DominicRobinson</dc:creator>
      <dc:date>2018-12-11T20:13:13Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with UTF-16 files and unicode characters</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28356#M20176</link>
      <description>&lt;P&gt;hi @Dominic Robinson​&amp;nbsp; , my colleague tells me that the CSV source should support UTF-16LE and UTF-16BE, but not plain UTF-16. It may be helpful to look at the test suite for the CSV source - it has simple examples of what is and isn't possible. It seems like you are saying that should be covered by UTF-16LE - if so, you may want to verify that there isn't a discrepancy caused by creating the file in Windows. If I recall correctly, Windows formats text files slightly differently than Unix/Mac does.&lt;/P&gt;&lt;P&gt;Side note, you should not use "com.databricks.spark.csv" anymore. Spark has a built-in csv data source as of Spark 2.0 and the Databricks package is no longer updated.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Dec 2018 00:19:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28356#M20176</guid>
      <dc:creator>User16817872376</dc:creator>
      <dc:date>2018-12-12T00:19:46Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with UTF-16 files and unicode characters</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28357#M20177</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;It can't read the a simple one column text file with the euro symbol - it doesn't seem to be a windows encoding issue either as I've written a file using vi on Fedora:&lt;/P&gt;
&lt;P&gt;Here is a very simple example file:&lt;/P&gt;
&lt;P&gt;&lt;A href="https://codiad.dcrdev.com/workspace/Workbin/test1.txt" target="test_blank"&gt;https://codiad.dcrdev.com/workspace/Workbin/test1.txt&lt;/A&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Dec 2018 10:32:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28357#M20177</guid>
      <dc:creator>DominicRobinson</dc:creator>
      <dc:date>2018-12-12T10:32:45Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with UTF-16 files and unicode characters</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28358#M20178</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;hi @Dominic Robinson I'm unable to create a simple reproduction of this issue. I was able to write out a file with the Euro symbol as the column using dataframe.write.csv(path), and the symbol was fine when I read the file back in using spark.read.csv(path). I think you are correct that the problem is the interaction between the csv source and whatever is producing your files.&lt;/P&gt;
&lt;P&gt;Did you try this out with the built-in csv source yet?&lt;/P&gt;
&lt;P&gt;If you are continuing to have problems, please raise a support ticket with Databricks. It could be a bug, or it could be your particular use case is unsupported and could be added to the csv source by Databricks.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Dec 2018 22:04:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28358#M20178</guid>
      <dc:creator>User16817872376</dc:creator>
      <dc:date>2018-12-12T22:04:04Z</dc:date>
    </item>
    <item>
      <title>Re: Issues with UTF-16 files and unicode characters</title>
      <link>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28359#M20179</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;You can also always read in the file as a textFile, and then run a UTF-16 decoder/encoder library as a UDF on the text.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Dec 2018 22:05:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/issues-with-utf-16-files-and-unicode-characters/m-p/28359#M20179</guid>
      <dc:creator>User16817872376</dc:creator>
      <dc:date>2018-12-12T22:05:09Z</dc:date>
    </item>
  </channel>
</rss>

