Delta Live Table autoloader's inferColumnTypes doe...

kwinsor5 · ‎08-03-2023

I am experimenting with DLTs/Autoloader. I have a simple, flat JSON file that I am attempting to load into a DLT (following this guide) like so:

CREATE OR REFRESH STREAMING LIVE TABLE statistics_live
COMMENT "The raw statistics data"
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM cloud_files("/mnt/raw/statistics/", "json", map("cloudFiles.inferColumnTypes", "true"));

The error message I am getting is: com.databricks.sql.cloudfiles.errors.CloudFilesAnalysisException: Failed to infer schema for format json from existing files in input path /mnt/raw/statistics/. Please ensure you configured the options properly or explicitly specify the schema.

The JSON file looks like this:

[
  {
    "pass": 26,
    "rush": 5,
    "total_return": 1,
    "total": 32,
    "fumble_return": 0,
    "int_return": 1,
    "kick_return": 0,
    "punt_return": 0,
    "other": 0
  }
]

I've seen a lot of "answers" out there saying to just specify the schema but if I expect my schema to change over time that is not an option.

EDIT: Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.

kwinsor5 · ‎08-03-2023

Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.

pavlos_skev · ‎07-25-2024

I had the same issue with a similar JSON structure as yours. Adding the option "multiLine" set to true fixed it for me.

df = (spark.readStream.format("cloudFiles")
  .option("multiLine", "true")
  .option("cloudFiles.schemaLocation", schemaLocation)
  .option("cloudFiles.format", "json")
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
  .load(landingZoneLocation)
)

Delta Live Table autoloader's inferColumnTypes does not work