Databricks Community

EirikMa · ‎03-25-2024

Issues with UTF-8 in DLT

I am having issues with UTF-8 in DLT:

I have tried to set the spark config on the cluster running the DLT pipeline:

I have fixed this with normal compute under advanced settings like this:
spark.conf.set("spark.driver.extraJavaOptions", "-Dfile.encoding=UTF-8")

spark.conf.set("spark.executor.extraJavaOptions", "-Dfile.encoding=UTF-8")

However, this does not work with DLT. Have any of you guys figured this out?

- Eirik

Kaniz_Fatma · ‎03-26-2024

Hi @EirikMa, It appears you’re encountering issues related to UTF-8 encoding in Delta Live Tables (DLT). While you’ve successfully resolved this in a regular Spark compute environment, it’s not working as expected within DLT.

Let’s explore some potential solutions and insights:

DLT and _metadata Column:
- You mentioned accessing the _metadata column using cloudFiles in DLT. However, it seems that selecting this column in a DLT task isn’t yielding the expected results.
- The code snippet you provided reads data from AWS S3 using cloudFiles and specifies the encoding as UTF-8. The issue might be related to DLT’s specific behavior.
- It’s essential to understand that DLT operates within its own context, and certain features may behave differently compared to standalone clusters.
Databricks Runtime (DR) Version:
- You mentioned using Databricks Runtime 10.5. On a standalone cluster (outside of the DLT pipeline), this feature works as expected.
- However, within DLT, you cannot explicitly choose the runtime version (i.e., set spark_version) in the pipeline settings.
- DLT currently runs on runtime 10.3, and features may vary based on the specific version.
DLT Preview Mode (DBR 11.0):
- In a recent update, DLT introduced a preview mode based on DBR 11.0 (Databricks Runtime 11.0).
- While Databricks doesn’t recommend using preview mode for production workloads, you could test the _metadata column feature there.
- Keep in mind that this mode is for exploration and testing, so exercise caution if you decide to use it in production.
File Metadata Retrieval:
- If accessing file metadata (such as modification timestamps) is crucial for your use case, consider alternative approaches.
- You might explore other ways to retrieve metadata directly from the file system or use additional tools within DLT.

Remember that DLT’s behaviour might differ from standalone clusters, and it’s essential to adapt your approach accordingly. If possible, test the _metadata column feature in DLT’s preview mode and assess its suitability for ...¹ ².

Feel free to share any additional details or specific challenges you’re facing, and we can dive deeper into finding a solution! 🚀

EirikMa · ‎05-03-2024

Hi @Kaniz_Fatma!

Sorry for a long wait...

The problem is not the columns or the data itself, the UTF-8 option for csv is working fine. The issue is with table_names not being compatible it seems. If I run the query through Auto Loader outside DLT and use backticks for catalog_name, schema_name and table_name, likt this: `dev`.`bronze`.`bokføring` it works perfectly.

Is there anyway that this can be done in DLT? Do you know the timeline when the runtime will be upgraded so that it will work?