Can someone please offer some insight - I've spent days trying to solve this issue
We have the task of loading in hundreds of tab seperated text files encoded in UTF-16 little endian with a tab delimiter. Our organisation is an international one and therefore our source contains lots of unicode characters. The encoding of the files cannot be changed, nor can the format.
The issue I'm seeing quite frequently is that these unicode characters are not getting displayed correctly via the spark interpreter - additionally this problem causes the tab delimeter to be escaped, ultimately resulting in subsequent columns shifting to the left.
A prime example of this is the euro symbol U+20AC €, the symbol displays fine when opened in Notepad++, vi or pretty much any unicode capable editor.
However when displayed in a dataframe I see ""¥", I thought this might be a problem with the way our application is encoding files, but no it seems to extend to any UTF-16LE file encoded in Windows. I can reproduce this every single time by simply typing the euro symbol into Windows notepad saving the file with UTF-16 encoding and loading it into databricks.
This is causing us real problems - can anyone help?
Sample code:
val df = spark.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\\t")
.option("endian", "little")
.option("encoding", "UTF-16")
.option("charset", "UTF-16")
.option("timestampFormat", "yyyy-MM-dd hh:mm:ss")
.option("codec", "gzip")
.option("sep", "\t")
.csv("mnt/adls/test/cu100.gz")
display(df)
It somehow seems like it might be a problem with the csv connector, because:
val test = Seq("€")
val t = test.toDF
display(t)
Works absoloutely fine