topic Re: Setting up my first DLT Pipeline with 3rd party JSON data in Data Engineering

Setting up my first DLT Pipeline with 3rd party JSON data

thains — Tue, 27 Dec 2022 16:36:49 GMT

I'm getting an error when I try to create a DLT Pipeline from a bunch of third-party app-usage data we have. Here's the error message:

Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.

Please upgrade your Delta table to reader version 2 and writer version 5

and change the column mapping mode to 'name' mapping. You can use the following command:

ALTER TABLE <table_name> SET TBLPROPERTIES (

'delta.columnMapping.mode' = 'name',

'delta.minReaderVersion' = '2',

'delta.minWriterVersion' = '5')

So, I added the properties to my table definition, and I'm still getting the error. What am I doing wrong? Here's the table definition:

CREATE STREAMING LIVE TABLE clevertap_analytics_bronze

COMMENT "App usage data from CleverTap"

TBLPROPERTIES ("myCustomPipeline.quality" = "bronze",

"delta.columnMapping.mode" = "name",

"delta.minReaderVersion" = "2",

"delta.minWriterVersion" = "5"

)

SELECT

FROM

cloud_files(

-- REPLACE THE BELOW LINE WITH THE EXACT S3 LOCATION WHERE YOU DATA LIVES

"s3://clevertap-analytics/",

"json",

-- CHANGE THE FOLLOWING TO "false" IF THE CSV FILE(s) DO NOT INCLUDE A HEADER

map(

"header", "true",

"cloudFiles.inferColumnTypes", "true",

"cloudFiles.schemaEvolutionMode", "rescue",

"rescuedDataColumn", "rescue_col"

)

);

Re: Setting up my first DLT Pipeline with 3rd party JSON data

thains — Tue, 03 Jan 2023 15:08:54 GMT

I added that version to my table definition, yes. Did I do it right? My table definition is in the OP.

Re: Setting up my first DLT Pipeline with 3rd party JSON data

jose_gonzalez — Tue, 31 Jan 2023 00:13:55 GMT

You might need to do a full refresh if these changes does not work

Re: Setting up my first DLT Pipeline with 3rd party JSON data

thains — Fri, 03 Feb 2023 14:20:59 GMT

It appears the problem is that the json files have keys with spaces in the names, like this:

"CT App Version":"3.5.6.6"

I've checked and that is supposedly a valid json key, even though it's not standard. Unfortunately, these files are generated by a third-party, so I don't have a lot of control over the content.

It looks like there might be a solution if I use python for the auto-loader, as I think I need to do something like this:

select([col(c).alias(c.replace(" ", "_")) for c in dlt.readStream("vw_raw").columns])

(from https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-column-names-using-delta-live-into-a-streaming-table?t=1675275633543)

However, I am a DB guy, not a python guy. Is there something equivalent available for the SQL version of the loader?

Re: Setting up my first DLT Pipeline with 3rd party JSON data

thains — Tue, 07 Feb 2023 22:25:55 GMT

That did not help, sadly. However, I think I've identified the actual issue... See my comment from Feb 3rd.

Re: Setting up my first DLT Pipeline with 3rd party JSON data

thains — Mon, 13 Feb 2023 21:14:53 GMT

I found this other forum thread that looks potentially useful, but I can’t figure out either how to translate it to SQL to handle JSON, nor how to get the pipeline I’m working with to interpret the Python. When I switch to Python, it complains about the line it inserts telling it that the script is python!

https://community.databricks.com/s/question/0D58Y000092eaqcSAA/ingest-a-csv-file-with-spaces-in-column-names-using-delta-live-into-a-streaming-table

Still looking for ideas!

Re: Setting up my first DLT Pipeline with 3rd party JSON data

Debayan — Mon, 02 Jan 2023 18:54:43 GMT

Hi, Could you please confirm if you have also upgraded the Delta table as mentioned?