cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks JDBC SQL Warehouse Encoding Issue

crankerkor
New Contributor III

Hi Everyone.

I am trying to connect and read data from the Databricks table using SQL Warehouse and return it using Azure API.

However, the non-English characters, for example, 'Ä', are present in the response as following: ��.

I am using the databricks-jdbc driver of the latest version.

I have tried to resolve it by setting the System properties as:

System.setProperty("file.encoding", "UTF-8");
System.setProperty("sun.jnu.encoding", "UTF-8");

Another thing that I tried was changing the connection string to contain:

useUnicode=true;characterEncoding=UTF-8

However, this causes the exception:

Internal Server Error: [Databricks][DatabricksJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, infoMessages:[*org.apache.hive.service.cli.HiveSQLException:Configuration useUnicode is not available

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Kaniz_Fatma 
I was able to resolve the issue by changing the approach of setting system properties from the code itself at the start of the execution to propagating them to the Azure Function environment variables in JAVA_OPTS. This way the JVM is instantiated already with the proper configuration.
Thanks a lot

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @crankerkor

  1. JDBC Driver Configuration:

    • Ensure that you are using the correct JDBC driver. You mentioned using the databricks-jdbc driver. Make sure it’s the latest version and compatible with your Databricks cluster.
    • The Simba Spark JDBC driver is commonly used for Databricks connections. If you’re using a different one, consider switching to Simba.
  2. System Properties:

    • You’re on the right track by setting system properties. However, let’s refine it:
      • Set the file.encoding and sun.jnu.encoding properties to UTF-8.
      • Make sure these properties are set before creating the JDBC connection.
      • Example in Java:
        System.setProperty("file.encoding", "UTF-8");
        System.setProperty("sun.jnu.encoding", "UTF-8");
        
  3. JDBC Connection String:

    • Your connection string should include the following options:
      • useUnicode=true: This ensures that Unicode characters are handled correctly.
      • characterEncoding=UTF-8: Specifies the character encoding.
    • Example connection string:
      jdbc:databricks://<hostname>:443/default;useUnicode=true;characterEncoding=UTF-8
      
  4. Charset Auto-Detection:

    • Sometimes, charset auto-detection can cause issues. To disable it, explicitly set the charset using the encoding option:
      %scala
      option("encoding", "UTF-16LE")
      
  5. Database Collation:

    • Ensure that your database collation is set to UTF-8. If it’s not, you might encounter issues with character encoding.
    • Check the collation settings in your database management system (e.g., MySQL, PostgreSQL).
  6. Parquet Files:

    • If possible, consider using Parquet files directly instead of going through the JDBC driver. Parquet files handle Unicode characters well.

Remember that handling character encoding involves coordination between your ETL tool, Databricks, and the JDBC driver. If you’ve tried all the steps above and still face issues, consider reaching out to Databricks support or community forums for further assistance.

References:

  1. Failure to detect encoding in JSON - Databricks
  2. Community Discussion: How to properly load Unicode (UTF-8) characters from table over JDBC
 

Hi @Kaniz_Fatma 
I was able to resolve the issue by changing the approach of setting system properties from the code itself at the start of the execution to propagating them to the Azure Function environment variables in JAVA_OPTS. This way the JVM is instantiated already with the proper configuration.
Thanks a lot

151640
New Contributor III

If Databricks support/Product managers follow the forum, suggest you review the SIMBA provided docs.
It does not discuss the name value pairs mentioned re utf and encoding.
https://www.databricks.com/spark/jdbc-drivers-download

There are other gaps in the SIMBA docs re name-value pairs including PreparedMetadataLimitZero

151640
New Contributor III

If Databricks support/Product Management following the forum, note that PDF from SIMBA in 2.6.28 does not discuss the name-value pairs in the above solution.

Other errata includes PreparedMetadataLimitZero.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!