cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Setting up my first DLT Pipeline with 3rd party JSON data

thains
New Contributor III

I'm getting an error when I try to create a DLT Pipeline from a bunch of third-party app-usage data we have. Here's the error message:

Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.

Please upgrade your Delta table to reader version 2 and writer version 5

and change the column mapping mode to 'name' mapping. You can use the following command:

ALTER TABLE <table_name> SET TBLPROPERTIES (

'delta.columnMapping.mode' = 'name',

'delta.minReaderVersion' = '2',

'delta.minWriterVersion' = '5')

So, I added the properties to my table definition, and I'm still getting the error. What am I doing wrong? Here's the table definition:

CREATE STREAMING LIVE TABLE clevertap_analytics_bronze 

 COMMENT "App usage data from CleverTap" 

 TBLPROPERTIES ("myCustomPipeline.quality" = "bronze",

  "delta.columnMapping.mode" = "name",

  "delta.minReaderVersion" = "2",

  "delta.minWriterVersion" = "5"

AS

SELECT

 *

FROM

 cloud_files(

  -- REPLACE THE BELOW LINE WITH THE EXACT S3 LOCATION WHERE YOU DATA LIVES

  "s3://clevertap-analytics/",

  "json",

 -- CHANGE THE FOLLOWING TO "false" IF THE CSV FILE(s) DO NOT INCLUDE A HEADER

  map(

   "header", "true",

   "cloudFiles.inferColumnTypes", "true",

   "cloudFiles.schemaEvolutionMode", "rescue",

   "rescuedDataColumn", "rescue_col"

  )

 );

6 REPLIES 6

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Could you please confirm if you have also upgraded the Delta table as mentioned?

thains
New Contributor III

I added that version to my table definition, yes. Did I do it right? My table definition is in the OP.

You might need to do a full refresh if these changes does not work

thains
New Contributor III

That did not help, sadly. However, I think I've identified the actual issue... See my comment from Feb 3rd.

thains
New Contributor III

It appears the problem is that the json files have keys with spaces in the names, like this:

"CT App Version":"3.5.6.6"

I've checked and that is supposedly a valid json key, even though it's not standard. Unfortunately, these files are generated by a third-party, so I don't have a lot of control over the content.

It looks like there might be a solution if I use python for the auto-loader, as I think I need to do something like this:

select([col(c).alias(c.replace(" ", "_")) for c in dlt.readStream("vw_raw").columns])

(from https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-colu...

However, I am a DB guy, not a python guy. Is there something equivalent available for the SQL version of the loader?

thains
New Contributor III

I found this other forum thread that looks potentially useful, but I can’t figure out either how to translate it to SQL to handle JSON, nor how to get the pipeline I’m working with to interpret the Python. When I switch to Python, it complains about the line it inserts telling it that the script is python!

https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-colu...

Still looking for ideas!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.