Delta Live Table autoloader's inferColumnTypes does not work
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2023 11:39 AM - edited 08-03-2023 01:18 PM
I am experimenting with DLTs/Autoloader. I have a simple, flat JSON file that I am attempting to load into a DLT (following this guide) like so:
CREATE OR REFRESH STREAMING LIVE TABLE statistics_live
COMMENT "The raw statistics data"
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM cloud_files("/mnt/raw/statistics/", "json", map("cloudFiles.inferColumnTypes", "true"));
The error message I am getting is: com.databricks.sql.cloudfiles.errors.CloudFilesAnalysisException: Failed to infer schema for format json from existing files in input path /mnt/raw/statistics/. Please ensure you configured the options properly or explicitly specify the schema.
The JSON file looks like this:
[
{
"pass": 26,
"rush": 5,
"total_return": 1,
"total": 32,
"fumble_return": 0,
"int_return": 1,
"kick_return": 0,
"punt_return": 0,
"other": 0
}
]
I've seen a lot of "answers" out there saying to just specify the schema but if I expect my schema to change over time that is not an option.
EDIT: Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-03-2023 01:18 PM
Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2024 02:03 AM - edited 07-25-2024 02:20 AM
I had the same issue with a similar JSON structure as yours. Adding the option "multiLine" set to true fixed it for me.
df = (spark.readStream.format("cloudFiles")
.option("multiLine", "true")
.option("cloudFiles.schemaLocation", schemaLocation)
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load(landingZoneLocation)
)