topic Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver in Data Engineering

How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan — Thu, 23 Sep 2021 00:42:40 GMT

Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.

The issue is that when the data comes over all of the foreign language and special characters are converted to junk characters. I have searched for any configuration setting for using unicode or UTF-8 with the JDBC url or config settings but couldn't find anything. The ODBC version of the Simba drive does have a property called "UseUnicodeSqlCharacterTypes" which if enabled the ODBC connector returns SQL_WVARCHAR for STRING and VARCHAR columns, and returns SQL_WCHAR for CHAR columns.

There doesn't seem to be anything I can do for the JDBC driver. Is there some other JDBC driver or some other method I can try to get the properly encoded unicode data over JDBC? Any help will be greatly appreciated. Thanks.

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

-werners- — Thu, 23 Sep 2021 07:19:37 GMT

AFAIK Databricks handles unicode well. It seems your ETL tool is not configured for UTF8?

We had the same issue copying data into a database. The cause was a non-unicode collation on the database.

Your ETL tool should recognize the string columns of the databricks tables as UTF8.

Maybe you can try to bypass the JDBC driver and use the parquet files directly instead of over the table interface?

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan — Thu, 23 Sep 2021 12:10:10 GMT

Hi Werners, the issue is not Databricks (all of the data looks fine and in proper encoding when I look at it there), the issue is the SIMBA JDBC Driver which by default appears to be bringing over columns of datatype STRING as SQL_VARCHAR instead of SQL_WVARCHAR, for this specific use case i need to use the table interface. Other JDBC drivers typically have some property you can set to tell it to use unicode and UTF8, shocked I can't find this for the SIMBA JDBC Driver which Databricks provides on the site.

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

-werners- — Thu, 23 Sep 2021 12:12:13 GMT

That is the reason I asked if you can bypass the jdbc driver by reading the parquet files directly. Is your ETL tool able to read parquet files written by Databricks?

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan — Thu, 23 Sep 2021 12:24:00 GMT

Yes the tool could read the parquet files but in this instance it would not be optimal do so as there can by multiple versions of the parquet organized in date_time_stamp sub folders. The Table is updated to use the latest version so I just have to reference the same table in my ETL routine. Otherwise I would have to programmatically figure out the latest version of the parquet to read. It could be done but not preferred. Especially if I want to make updates/changes to the Delta Table, for that I have to do it over the JDBC connection.

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

-werners- — Thu, 23 Sep 2021 12:42:03 GMT

Ok I see.

maybe you can pass the character encoding in the connection you create in spark,

like in here? This example is Oracle but it might work with the Simba driver too.

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan — Thu, 23 Sep 2021 14:35:39 GMT

Yeah I saw that same post earlier and tried adding those properties as jdbc url properties but it didn't work. I think each driver has its own implementation of url properties that you can use and they are just not there for the Simba Driver but available for the Oracle Driver you see in the post.

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Anonymous — Fri, 01 Oct 2021 08:56:27 GMT

Can you try setting

UseUnicodeSqlCharacterTypes=1

in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?

Re: How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan — Fri, 01 Oct 2021 16:11:53 GMT

Hello User,

I actually found the solution to this issue and it partially related to what you suggested.

Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.

Ultimately I realized that the issue was with the JAVA system properties as you also suggested.

I had to update 2 properties:

file.encoding (like you suggested)

sun.jnu.encoding

Once I set both of those to UTF-8, everything was good.