cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to properly load Unicode (UTF-8) characters from table over JDBC connection using Simba Spark Driver

Quan
New Contributor III

Hello all, I'm trying to pull table data from databricks tables that contain foreign language characters in UTF-8 into an ETL tool using a JDBC connection. I'm using the latest Simba Spark JDBC driver available from the Databricks website.

The issue is that when the data comes over all of the foreign language and special characters are converted to junk characters. I have searched for any configuration setting for using unicode or UTF-8 with the JDBC url or config settings but couldn't find anything. The ODBC version of the Simba drive does have a property called "UseUnicodeSqlCharacterTypes" which if enabled the ODBC connector returns SQL_WVARCHAR for STRING and VARCHAR columns, and returns SQL_WCHAR for CHAR columns.

There doesn't seem to be anything I can do for the JDBC driver. Is there some other JDBC driver or some other method I can try to get the properly encoded unicode data over JDBC? Any help will be greatly appreciated. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Quan
New Contributor III

Hello User,

I actually found the solution to this issue and it partially related to what you suggested.

Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.

Ultimately I realized that the issue was with the JAVA system properties as you also suggested.

I had to update 2 properties:

file.encoding (like you suggested)

sun.jnu.encoding

Once I set both of those to UTF-8, everything was good.

View solution in original post

9 REPLIES 9

Kaniz_Fatma
Community Manager
Community Manager

Hi @ Quan! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will follow up with my team and get back to you soon.Thanks.

-werners-
Esteemed Contributor III

AFAIK Databricks handles unicode well. It seems your ETL tool is not configured for UTF8?

We had the same issue copying data into a database. The cause was a non-unicode collation on the database.

Your ETL tool should recognize the string columns of the databricks tables as UTF8.

Maybe you can try to bypass the JDBC driver and use the parquet files directly instead of over the table interface?

Quan
New Contributor III

Hi Werners, the issue is not Databricks (all of the data looks fine and in proper encoding when I look at it there), the issue is the SIMBA JDBC Driver which by default appears to be bringing over columns of datatype STRING as SQL_VARCHAR instead of SQL_WVARCHAR, for this specific use case i need to use the table interface. Other JDBC drivers typically have some property you can set to tell it to use unicode and UTF8, shocked I can't find this for the SIMBA JDBC Driver which Databricks provides on the site.

-werners-
Esteemed Contributor III

That is the reason I asked if you can bypass the jdbc driver by reading the parquet files directly. Is your ETL tool able to read parquet files written by Databricks?

Quan
New Contributor III

Yes the tool could read the parquet files but in this instance it would not be optimal do so as there can by multiple versions of the parquet organized in date_time_stamp sub folders. The Table is updated to use the latest version so I just have to reference the same table in my ETL routine. Otherwise I would have to programmatically figure out the latest version of the parquet to read. It could be done but not preferred. Especially if I want to make updates/changes to the Delta Table, for that I have to do it over the JDBC connection.

-werners-
Esteemed Contributor III

Ok I see.

maybe you can pass the character encoding in the connection you create in spark,

like in here? This example is Oracle but it might work with the Simba driver too.

Quan
New Contributor III

Yeah I saw that same post earlier and tried adding those properties as jdbc url properties but it didn't work. I think each driver has its own implementation of url properties that you can use and they are just not there for the Simba Driver but available for the Oracle Driver you see in the post.

Anonymous
Not applicable

Can you try setting 

UseUnicodeSqlCharacterTypes=1

 in the driver, and also make sure 'file.encoding' is set to UTF-8 in jvm and see if the issue still persists?

Quan
New Contributor III

Hello User,

I actually found the solution to this issue and it partially related to what you suggested.

Initially I did try the UseUnicodeSqlCharacterTypes=1 but that did not make a difference.

Ultimately I realized that the issue was with the JAVA system properties as you also suggested.

I had to update 2 properties:

file.encoding (like you suggested)

sun.jnu.encoding

Once I set both of those to UTF-8, everything was good.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group