<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14785#M9226</link>
    <description>&lt;P&gt;Ok I see.&lt;/P&gt;&lt;P&gt;maybe you can pass the character encoding in the connection you create in spark,&lt;/P&gt;&lt;P&gt;like &lt;A href="https://stackoverflow.com/questions/59052345/how-to-fix-encoding-problem-with-spark-jdbc" alt="https://stackoverflow.com/questions/59052345/how-to-fix-encoding-problem-with-spark-jdbc" target="_blank"&gt;in here&lt;/A&gt;? This example is Oracle but it might work with the Simba driver too.&lt;/P&gt;</description>
    <pubDate>Thu, 23 Sep 2021 12:42:03 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-09-23T12:42:03Z</dc:date>
    <item>
      <title>How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14779#M9220</link>
      <description>&lt;P&gt;Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest &lt;B&gt;Simba Spark JDBC driver&lt;/B&gt; available from the Databricks website.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The issue is that when the data comes over all of the foreign language and special characters are converted to junk characters. I have searched for any configuration setting for using &lt;B&gt;unicode  &lt;/B&gt;or &lt;B&gt;UTF-8&lt;/B&gt; with the JDBC url or config settings but couldn't find anything. The ODBC version of the Simba drive does have a property called "&lt;B&gt;UseUnicodeSqlCharacterTypes&lt;/B&gt;" which if enabled the ODBC connector returns SQL_WVARCHAR for&amp;nbsp;STRING and&amp;nbsp;VARCHAR columns, and returns SQL_WCHAR for CHAR columns. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There doesn't seem to be anything I can do for the JDBC driver. Is there some other JDBC driver or some other method I can try to get the properly encoded unicode data over JDBC? Any help will be greatly appreciated. Thanks.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 00:42:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14779#M9220</guid>
      <dc:creator>Quan</dc:creator>
      <dc:date>2021-09-23T00:42:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14781#M9222</link>
      <description>&lt;P&gt;AFAIK Databricks handles unicode well.  It seems your ETL tool is not configured for UTF8?&lt;/P&gt;&lt;P&gt;We had the same issue copying data into a database.  The cause was a non-unicode collation on the database.&lt;/P&gt;&lt;P&gt;Your ETL tool should recognize the string columns of the databricks tables as UTF8.&lt;/P&gt;&lt;P&gt;Maybe you can try to bypass the JDBC driver and use the parquet files directly instead of over the table interface?&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 07:19:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14781#M9222</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-09-23T07:19:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14782#M9223</link>
      <description>&lt;P&gt;Hi Werners, the issue is not Databricks (all of the data looks fine and in proper encoding when I look at it there), the issue is the SIMBA JDBC Driver which by default appears to be bringing over columns of datatype STRING as SQL_VARCHAR instead of SQL_WVARCHAR, for this specific use case i need to use the table interface. Other JDBC drivers typically have some property you can set to tell it to use unicode and UTF8, shocked I can't find this for the SIMBA JDBC Driver which Databricks provides on the site.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 12:10:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14782#M9223</guid>
      <dc:creator>Quan</dc:creator>
      <dc:date>2021-09-23T12:10:10Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14783#M9224</link>
      <description>&lt;P&gt;That is the reason I asked if you can bypass the jdbc driver by reading the parquet files directly.  Is your ETL tool able to read parquet files written by Databricks?&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 12:12:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14783#M9224</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-09-23T12:12:13Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14784#M9225</link>
      <description>&lt;P&gt;Yes the tool could read the parquet files but in this instance it would not be optimal do so as there can by multiple versions of the parquet organized in date_time_stamp sub folders. The Table is updated to use the latest version so I just have to reference the same table in my ETL routine. Otherwise I would have to programmatically figure out the latest version of the parquet to read. It could be done but not preferred. Especially if I want to make updates/changes to the Delta Table, for that I have to do it over the JDBC connection.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 12:24:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14784#M9225</guid>
      <dc:creator>Quan</dc:creator>
      <dc:date>2021-09-23T12:24:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14785#M9226</link>
      <description>&lt;P&gt;Ok I see.&lt;/P&gt;&lt;P&gt;maybe you can pass the character encoding in the connection you create in spark,&lt;/P&gt;&lt;P&gt;like &lt;A href="https://stackoverflow.com/questions/59052345/how-to-fix-encoding-problem-with-spark-jdbc" alt="https://stackoverflow.com/questions/59052345/how-to-fix-encoding-problem-with-spark-jdbc" target="_blank"&gt;in here&lt;/A&gt;? This example is Oracle but it might work with the Simba driver too.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 12:42:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14785#M9226</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-09-23T12:42:03Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14786#M9227</link>
      <description>&lt;P&gt;Yeah I saw that same post earlier and tried adding those properties as jdbc url properties but it didn't work. I think each driver has its own implementation of url properties that you can use and they are just not there for the Simba Driver but available for the Oracle Driver you see in the post.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 14:35:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14786#M9227</guid>
      <dc:creator>Quan</dc:creator>
      <dc:date>2021-09-23T14:35:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14787#M9228</link>
      <description>&lt;P&gt;Can you try setting&amp;nbsp;&lt;/P&gt;&lt;P&gt;UseUnicodeSqlCharacterTypes=1&lt;/P&gt;&lt;P&gt;&amp;nbsp;in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?&lt;/P&gt;</description>
      <pubDate>Fri, 01 Oct 2021 08:56:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14787#M9228</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2021-10-01T08:56:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14788#M9229</link>
      <description>&lt;P&gt;Hello User,&lt;/P&gt;&lt;P&gt; I actually found the solution to this issue and it partially related to what you suggested.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Ultimately I realized that the issue was with the JAVA system properties as you also suggested.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I had to update 2 properties:&lt;/P&gt;&lt;P&gt;file.encoding (like you suggested)&lt;/P&gt;&lt;P&gt;sun.jnu.encoding&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Once I set both of those to UTF-8, everything was good. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Oct 2021 16:11:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-properly-load-unicode-utf-8-characters-from-table-over/m-p/14788#M9229</guid>
      <dc:creator>Quan</dc:creator>
      <dc:date>2021-10-01T16:11:53Z</dc:date>
    </item>
  </channel>
</rss>

