cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autolodaer schemaHints convert valid values to null

_databreaks
New Contributor

I am ingesting json files from S3 using Autoloader and would like to use schemaHints to define the datatype of one of the fields, that is, I wanted the field id to be of integer type.

The DLT code below infers the the id as string, with correct values.

 

create streaming table raw as
select * 
from cloud_files('/path', 'json',
  map("cloudFiles.inferColumnTypes", "true")
  );

 

But I would like the id column to be integer type, hence wanted to use schemaHints like below :

 

create streaming table raw as
select * 
from cloud_files('/path', 'json',
  map("cloudFiles.inferColumnTypes", "true",
      "cloudFiles.schemaHints", "id LONG"
  )
)

 

The resulting table will have the correct data type but all the values of the column/field id are now null.

I've done some checking on the data such as isnotnull(int(id)) to see if there is any record in not convertible to int but surprise surprise, all ids are castable to int.

I already have a work-around in mind. I just want to understand why would autoloader do that?

Is this a known issue/bug?

Regards.

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @_databreaksWhen using Auto Loader to ingest JSON files from S3, itโ€™s essential to configure schema inference and evolution correctly.

Letโ€™s dive into the details:

  1. Schema Inference and Evolution:

    • Auto Loader can automatically detect the schema of loaded data, allowing you to initialize tables without explicitly declaring the data schema. It also evolves the table schema as new columns are introduced, eliminating the need for manual schema tracking and updates.
    • Supported formats for schema inference and evolution include JSON, CSV, Avro, XML, Parquet, and ORC.
    • To enable schema inference and evolution, specify a target directory for the cloudFiles.schemaLocation option. This directory keeps track of your data schema over time.
    • When reading data for the first time, Auto Loader samples the first 50 GB or 1000 files (whichever limit is crossed first) to infer the schema. The inferred schema information is stored in the _schemas directory at the specified cloudFiles.schemaLocation.
  2. Schema Hints:

    • If you know that a column should be of a specific data type or want to choose a more general data type (e.g., using a double instead of an integer), you can provide hints for column data types.
    • Use SQL schema specification syntax to provide hints. For example, to set the id column as a LONG, you can use:
      "cloudFiles.schemaHints", "id LONG"
      
  3. Issue with Null Values:

    • You mentioned that after using schema hints, the resulting table has the correct data type for the id column, but all values are now null.
    • This behaviour might occur due to conflicting or overriding settings in your code or configuration.
    • Ensure that you explicitly set the id field as a string using the appropriate hint.
    • Also, check if there are any other factors causing the id field to be interpreted as an integer.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.