12-27-2022 08:36 AM
I'm getting an error when I try to create a DLT Pipeline from a bunch of third-party app-usage data we have. Here's the error message:
Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.
Please upgrade your Delta table to reader version 2 and writer version 5
and change the column mapping mode to 'name' mapping. You can use the following command:
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
So, I added the properties to my table definition, and I'm still getting the error. What am I doing wrong? Here's the table definition:
CREATE STREAMING LIVE TABLE clevertap_analytics_bronze
COMMENT "App usage data from CleverTap"
TBLPROPERTIES ("myCustomPipeline.quality" = "bronze",
"delta.columnMapping.mode" = "name",
"delta.minReaderVersion" = "2",
"delta.minWriterVersion" = "5"
)
AS
SELECT
*
FROM
cloud_files(
-- REPLACE THE BELOW LINE WITH THE EXACT S3 LOCATION WHERE YOU DATA LIVES
"s3://clevertap-analytics/",
"json",
-- CHANGE THE FOLLOWING TO "false" IF THE CSV FILE(s) DO NOT INCLUDE A HEADER
map(
"header", "true",
"cloudFiles.inferColumnTypes", "true",
"cloudFiles.schemaEvolutionMode", "rescue",
"rescuedDataColumn", "rescue_col"
)
);
01-02-2023 10:54 AM
Hi, Could you please confirm if you have also upgraded the Delta table as mentioned?
01-03-2023 07:08 AM
I added that version to my table definition, yes. Did I do it right? My table definition is in the OP.
01-30-2023 04:13 PM
You might need to do a full refresh if these changes does not work
02-07-2023 02:25 PM
That did not help, sadly. However, I think I've identified the actual issue... See my comment from Feb 3rd.
02-03-2023 06:20 AM
It appears the problem is that the json files have keys with spaces in the names, like this:
"CT App Version":"3.5.6.6"
I've checked and that is supposedly a valid json key, even though it's not standard. Unfortunately, these files are generated by a third-party, so I don't have a lot of control over the content.
It looks like there might be a solution if I use python for the auto-loader, as I think I need to do something like this:
select([col(c).alias(c.replace(" ", "_")) for c in dlt.readStream("vw_raw").columns])
However, I am a DB guy, not a python guy. Is there something equivalent available for the SQL version of the loader?
02-13-2023 01:14 PM
I found this other forum thread that looks potentially useful, but I can’t figure out either how to translate it to SQL to handle JSON, nor how to get the pipeline I’m working with to interpret the Python. When I switch to Python, it complains about the line it inserts telling it that the script is python!
Still looking for ideas!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group