12-27-2022 08:36 AM
I'm getting an error when I try to create a DLT Pipeline from a bunch of third-party app-usage data we have. Here's the error message:
Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.
Please upgrade your Delta table to reader version 2 and writer version 5
and change the column mapping mode to 'name' mapping. You can use the following command:
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
So, I added the properties to my table definition, and I'm still getting the error. What am I doing wrong? Here's the table definition:
CREATE STREAMING LIVE TABLE clevertap_analytics_bronze
COMMENT "App usage data from CleverTap"
TBLPROPERTIES ("myCustomPipeline.quality" = "bronze",
"delta.columnMapping.mode" = "name",
"delta.minReaderVersion" = "2",
"delta.minWriterVersion" = "5"
)
AS
SELECT
*
FROM
cloud_files(
-- REPLACE THE BELOW LINE WITH THE EXACT S3 LOCATION WHERE YOU DATA LIVES
"s3://clevertap-analytics/",
"json",
-- CHANGE THE FOLLOWING TO "false" IF THE CSV FILE(s) DO NOT INCLUDE A HEADER
map(
"header", "true",
"cloudFiles.inferColumnTypes", "true",
"cloudFiles.schemaEvolutionMode", "rescue",
"rescuedDataColumn", "rescue_col"
)
);
01-02-2023 10:54 AM
Hi, Could you please confirm if you have also upgraded the Delta table as mentioned?
01-03-2023 07:08 AM
I added that version to my table definition, yes. Did I do it right? My table definition is in the OP.
01-30-2023 04:13 PM
You might need to do a full refresh if these changes does not work
02-07-2023 02:25 PM
That did not help, sadly. However, I think I've identified the actual issue... See my comment from Feb 3rd.
02-03-2023 06:20 AM
It appears the problem is that the json files have keys with spaces in the names, like this:
"CT App Version":"3.5.6.6"
I've checked and that is supposedly a valid json key, even though it's not standard. Unfortunately, these files are generated by a third-party, so I don't have a lot of control over the content.
It looks like there might be a solution if I use python for the auto-loader, as I think I need to do something like this:
select([col(c).alias(c.replace(" ", "_")) for c in dlt.readStream("vw_raw").columns])
However, I am a DB guy, not a python guy. Is there something equivalent available for the SQL version of the loader?
02-13-2023 01:14 PM
I found this other forum thread that looks potentially useful, but I can’t figure out either how to translate it to SQL to handle JSON, nor how to get the pipeline I’m working with to interpret the Python. When I switch to Python, it complains about the line it inserts telling it that the script is python!
Still looking for ideas!
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.