cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Setting up my first DLT Pipeline with 3rd party JSON data

thains
New Contributor III

I'm getting an error when I try to create a DLT Pipeline from a bunch of third-party app-usage data we have. Here's the error message:

Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.

Please upgrade your Delta table to reader version 2 and writer version 5

and change the column mapping mode to 'name' mapping. You can use the following command:

ALTER TABLE <table_name> SET TBLPROPERTIES (

'delta.columnMapping.mode' = 'name',

'delta.minReaderVersion' = '2',

'delta.minWriterVersion' = '5')

So, I added the properties to my table definition, and I'm still getting the error. What am I doing wrong? Here's the table definition:

CREATE STREAMING LIVE TABLE clevertap_analytics_bronze 

 COMMENT "App usage data from CleverTap" 

 TBLPROPERTIES ("myCustomPipeline.quality" = "bronze",

  "delta.columnMapping.mode" = "name",

  "delta.minReaderVersion" = "2",

  "delta.minWriterVersion" = "5"

AS

SELECT

 *

FROM

 cloud_files(

  -- REPLACE THE BELOW LINE WITH THE EXACT S3 LOCATION WHERE YOU DATA LIVES

  "s3://clevertap-analytics/",

  "json",

 -- CHANGE THE FOLLOWING TO "false" IF THE CSV FILE(s) DO NOT INCLUDE A HEADER

  map(

   "header", "true",

   "cloudFiles.inferColumnTypes", "true",

   "cloudFiles.schemaEvolutionMode", "rescue",

   "rescuedDataColumn", "rescue_col"

  )

 );

6 REPLIES 6

Debayan
Databricks Employee
Databricks Employee

Hi, Could you please confirm if you have also upgraded the Delta table as mentioned?

thains
New Contributor III

I added that version to my table definition, yes. Did I do it right? My table definition is in the OP.

jose_gonzalez
Databricks Employee
Databricks Employee

You might need to do a full refresh if these changes does not work

thains
New Contributor III

That did not help, sadly. However, I think I've identified the actual issue... See my comment from Feb 3rd.

thains
New Contributor III

It appears the problem is that the json files have keys with spaces in the names, like this:

"CT App Version":"3.5.6.6"

I've checked and that is supposedly a valid json key, even though it's not standard. Unfortunately, these files are generated by a third-party, so I don't have a lot of control over the content.

It looks like there might be a solution if I use python for the auto-loader, as I think I need to do something like this:

select([col(c).alias(c.replace(" ", "_")) for c in dlt.readStream("vw_raw").columns])

(from https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-colu...

However, I am a DB guy, not a python guy. Is there something equivalent available for the SQL version of the loader?

thains
New Contributor III

I found this other forum thread that looks potentially useful, but I canโ€™t figure out either how to translate it to SQL to handle JSON, nor how to get the pipeline Iโ€™m working with to interpret the Python. When I switch to Python, it complains about the line it inserts telling it that the script is python!

https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-colu...

Still looking for ideas!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group