cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT Autoloader schemaHints from JSON file instead of inline list?

Rohit_hk
Visitor

Hi @Witold@Hubert-Dudek,

Iโ€™m using a DLT pipeline to ingest realtime data from Parquet files in S3 into Delta tables using Auto Loader. The pipeline is written in SQL notebooks.

Problem:
Sometimes decimal columns in the Parquet files get inferred as INT, which breaks my downstream logic. To control this Iโ€™m using schemaHints, and it works if I pass the column definitions inline.

Working example:

select *
from stream cloud_files(
's3://my-bucket/path',
'parquet',
map('cloudFiles.schemaHints', 'id INT, sal DECIMAL(10,2)')
);

However, I donโ€™t want to hardcode the schema in the SQL. I tried to keep the schema in a JSON file and pass the path instead, something like:

select *
from stream cloud_files(
's3://my-bucket/path',
'parquet',
map('cloudFiles.schemaHints', 'dbfs:/schemas/my_table_schema.json')
);

This does NOT work โ€“ Auto Loader treats the value as a literal โ€œid INT, sal DECIMALโ€ฆโ€ style string, not as a path.

Questions:

  • Is this the expected behaviour (schemaHints only accepts an inline string and not a file path)?

  • Is there any supported way in DLT SQL to:

    • load a schema definition from a JSON file, and

    • feed it into cloudFiles / schemaHints (or another option) without hardcoding the full โ€œcol dtype, col2 dtypeโ€ฆโ€ string in the SQL?

Goal:
I want a single JSON schema file per source, and have multiple DLT SQL pipelines reuse it, while still preventing decimal columns from being inferred as INT.

Any suggestions or patterns (e.g., using Python to read the JSON and set pipeline configuration, schema evolution tricks, or alternative options in Auto Loader) would be really helpful.

2 REPLIES 2

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Rohit_hk ,

Below are the answers to your questions:

  • Is this the expected behaviour (schemaHints only accepts an inline string and not a file path)?

Ans:

Yes, cloudFiles.schemaHints is defined as a plain String option, and we can only pass  DDL strings, not paths. There is no support for interpreting its value as a path to a JSON file .Doc :https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options

 

  • Is there any supported way in DLT SQL to:

    • load a schema definition from a JSON file, and

    • feed it into cloudFiles / schemaHints (or another option) without hardcoding the full โ€œcol dtype, col2 dtypeโ€ฆโ€ string in the SQL?

Ans:

Not directly. In a pure SQL DLT (Lakeflow Declarative Pipelines) notebook:

  • You cannot read an arbitrary JSON file from DBFS/S3 and inject its contents into cloud_files or schemaHints at runtime.
  • The options map in cloud_files/read_files must be literals or parameter expansions, not the result of reading a file.

However, DLT / Lakeflow pipelines support parameters whose values are injected into your SQL as strings Dochttps://docs.databricks.com/aws/en/ldp/parameters

You can:

  1. Put this in your pipeline configuration (UI/JSON):

 

"configuration": { "myschema.my_table_hints": "id INT, sal DECIMAL(10,2), other_col STRING" }
  1. Reference it in your SQL notebook:

 
CREATE OR REFRESH STREAMING LIVE TABLE bronze_my_table AS
SELECT *
FROM STREAM cloud_files(
  's3://my-bucket/path',
  'parquet',
  map(
    'cloudFiles.schemaHints', '${myschema.my_table_hints}',
    'cloudFiles.schemaLocation', 'dbfs:/schemas/autoloader/my_table'
  )
);

 

Anudeep

Hubert-Dudek
Esteemed Contributor III

- dlt use automatically cloudFiles.schemaLocation So the schema is stored automatically, and in many cases, it will be stable, but it does not

- keep using cloudFiles.schemaHints, but just load JSON to a variable and pass that variable (I guess you will need some format conversion from JSON to SQL ddl, but it can be achieved with a simple python script)