cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Table autoloader's inferColumnTypes does not work

kwinsor5
New Contributor II

I am experimenting with DLTs/Autoloader. I have a simple, flat JSON file that I am attempting to load into a DLT (following this guide) like so: 

 

CREATE OR REFRESH STREAMING LIVE TABLE statistics_live
COMMENT "The raw statistics data"
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM cloud_files("/mnt/raw/statistics/", "json", map("cloudFiles.inferColumnTypes", "true"));

 

The error message I am getting is: com.databricks.sql.cloudfiles.errors.CloudFilesAnalysisException: Failed to infer schema for format json from existing files in input path /mnt/raw/statistics/. Please ensure you configured the options properly or explicitly specify the schema.

 

The JSON file looks like this: 

 

[
  {
    "pass": 26,
    "rush": 5,
    "total_return": 1,
    "total": 32,
    "fumble_return": 0,
    "int_return": 1,
    "kick_return": 0,
    "punt_return": 0,
    "other": 0
  }
]

 

 

I've seen a lot of "answers" out there saying to just specify the schema but if I expect my schema to change over time that is not an option. 

EDIT: Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.

2 REPLIES 2

kwinsor5
New Contributor II

Interestingly enough, I moved on to generating the full JSON file and storing it in our cloud storage rather than working with a partial file. The fully generated file was inferred correctly when I triggered the autoloader pipeline, complex child JSON properties and all. I guess I'll leave the question up though because I have no clue why the partial file was throwing exceptions at me.

pavlos_skev
New Contributor III

I had the same issue with a similar JSON structure as yours. Adding the option "multiLine" set to true fixed it for me.

df = (spark.readStream.format("cloudFiles")
  .option("multiLine", "true")
  .option("cloudFiles.schemaLocation", schemaLocation)
  .option("cloudFiles.format", "json")
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
  .load(landingZoneLocation)
)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group