cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Bug report: the delimiter option does not work when run on DLT

rpl
New Contributor III

I have a semicolon separated file in an ADLS container that's been added to Unity Catalog as an External location.

When I run the following code on an all-purpose cluster, it runs ok and displays the schema.

import dlt
@dlt.table
def test_data_csv():
  return (
     spark.readStream.format("cloudFiles")
     .option("cloudFiles.format", "csv")
     .option("delimiter", ";")
     .load(f"abfss://<container_name>@<storage_name>.dfs.core.windows.net/test_csv")
 )

The same code throws an error when ran via a Delta Live Tables pipeline:

com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @rpl, The error message you encountered when running the code via a Delta Live Tables pipeline indicates that there are invalid characters in the column names of your schema. Specifically, the characters ' ,;{}()\n\t=' are causing the issue.

 

To resolve this, consider the following steps:

 

Check Column Names: Inspect the column names in your semicolon-separated file. Ensure that they do not contain any of the problematic characters mentioned in the error message.

Column Renaming: If necessary, rename the columns to avoid using invalid characters. You can use the withColumnRenamed method in Spark to rename columns.

Unity Catalog External Locations: Since youโ€™re working with an External location in Unity Catalog, ensure that the storage credential and external location are correctly set up. External locations allow Unity Catalog to read and write data on your cloud tenant on behalf of users. They combine a cloud storage path with a storage credential (which authorizes access to the cloud st....

Storage Credential: Verify that the storage credential (using an Azure managed identity or service principal) is correctly configured. Each storage credential is subject to Unity Catalog access-control policies, so ensure that the nece....

Read-Only Marking: Consider marking the storage credential and external location as read-only if you want to prevent users from writing to them. This can be useful for maintaining data integrity.

Remember that external locations can be used not only for external tables but also for managed tables and volumes. They define storage locations for both external and managed data.

If you continue to face issues, double-check the column names and the configuration of your external location and storage credentials. Additionally, consult the documentation for further guidance on managing external locations in Unity....

rpl
New Contributor III

As mentioned, the code runs just fine on an all-purpose cluster, therefore special characters can't be the problem.

But i checked this further by making two versions of the file:

  • a regular CSV with comma as separator and dot as decimal separator,
  • the identical file but โ€œEuropean-styleโ€ with semicolon as separator and comma as decimal separator.

The regular CSV file works fine in a DLT pipeline, therefore there are no invalid characters in the column names

If I donโ€™t use .option("delimiter", ";"), then attempting to read the European-style CSV file throws the mentioned error both when run on an all-purpose cluster and when run in a DLT pipeline (as expected, because semicolons are invalid in column names).

When I specify .option("delimiter", ";"), the code will run on an all-purpose cluster but will throw the mentioned error when run in a DLT pipeline. To me that seems to indicate that .option("delimiter", ";") is for some reason ignored when run via DLT pipelines. The special characters indicated by the error message must be the semicolons, as there are no other special characters in the first row.

Can anyone reproduce this? 

rpl
New Contributor III

@Kaniz_Fatma can you confirm that .option("delimiter", ";") is ignored when run in a DLT pipeline? (please see the post above) My colleage confirmed the behavior. 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!