Databricks

Quan · ‎09-22-2021

Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.

The issue is that when the data comes over all of the foreign language and special characters are converted to junk characters. I have searched for any configuration setting for using unicode or UTF-8 with the JDBC url or config settings but couldn't find anything. The ODBC version of the Simba drive does have a property called "UseUnicodeSqlCharacterTypes" which if enabled the ODBC connector returns SQL_WVARCHAR for STRING and VARCHAR columns, and returns SQL_WCHAR for CHAR columns.

There doesn't seem to be anything I can do for the JDBC driver. Is there some other JDBC driver or some other method I can try to get the properly encoded unicode data over JDBC? Any help will be greatly appreciated. Thanks.

Quan · ‎10-01-2021

Hello User,

I actually found the solution to this issue and it partially related to what you suggested.

Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.

Ultimately I realized that the issue was with the JAVA system properties as you also suggested.

I had to update 2 properties:

file.encoding (like you suggested)

sun.jnu.encoding

Once I set both of those to UTF-8, everything was good.

View solution in original post

Kaniz · ‎09-22-2021

Hi @ Quan! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up with my team and get back to you soon.Thanks.

-werners- · ‎09-23-2021

AFAIK Databricks handles unicode well. It seems your ETL tool is not configured for UTF8?

We had the same issue copying data into a database. The cause was a non-unicode collation on the database.

Your ETL tool should recognize the string columns of the databricks tables as UTF8.

Maybe you can try to bypass the JDBC driver and use the parquet files directly instead of over the table interface?

Quan · ‎09-23-2021

Hi Werners, the issue is not Databricks (all of the data looks fine and in proper encoding when I look at it there), the issue is the SIMBA JDBC Driver which by default appears to be bringing over columns of datatype STRING as SQL_VARCHAR instead of SQL_WVARCHAR, for this specific use case i need to use the table interface. Other JDBC drivers typically have some property you can set to tell it to use unicode and UTF8, shocked I can't find this for the SIMBA JDBC Driver which Databricks provides on the site.

-werners- · ‎09-23-2021

That is the reason I asked if you can bypass the jdbc driver by reading the parquet files directly. Is your ETL tool able to read parquet files written by Databricks?

Quan · ‎09-23-2021

Yes the tool could read the parquet files but in this instance it would not be optimal do so as there can by multiple versions of the parquet organized in date_time_stamp sub folders. The Table is updated to use the latest version so I just have to reference the same table in my ETL routine. Otherwise I would have to programmatically figure out the latest version of the parquet to read. It could be done but not preferred. Especially if I want to make updates/changes to the Delta Table, for that I have to do it over the JDBC connection.

-werners- · ‎09-23-2021

Ok I see.

maybe you can pass the character encoding in the connection you create in spark,

like in here? This example is Oracle but it might work with the Simba driver too.

Quan · ‎09-23-2021

Yeah I saw that same post earlier and tried adding those properties as jdbc url properties but it didn't work. I think each driver has its own implementation of url properties that you can use and they are just not there for the Simba Driver but available for the Oracle Driver you see in the post.

Anonymous · ‎10-01-2021

Can you try setting

UseUnicodeSqlCharacterTypes=1

in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?

Quan · ‎10-01-2021

Hello User,

I actually found the solution to this issue and it partially related to what you suggested.

Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.

Ultimately I realized that the issue was with the JAVA system properties as you also suggested.

I had to update 2 properties:

file.encoding (like you suggested)

sun.jnu.encoding

Once I set both of those to UTF-8, everything was good.

Databricks

How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!