Hi @Witold, @Hubert-Dudek,
Iām using a DLT pipeline to ingest realtime data from Parquet files in S3 into Delta tables using Auto Loader. The pipeline is written in SQL notebooks.
Problem:
Sometimes decimal columns in the Parquet files get inferred as INT, which breaks my downstream logic. To control this Iām using schemaHints, and it works if I pass the column definitions inline.
Working example:
select *
from stream cloud_files(
's3://my-bucket/path',
'parquet',
map('cloudFiles.schemaHints', 'id INT, sal DECIMAL(10,2)')
);
However, I donāt want to hardcode the schema in the SQL. I tried to keep the schema in a JSON file and pass the path instead, something like:
select *
from stream cloud_files(
's3://my-bucket/path',
'parquet',
map('cloudFiles.schemaHints', 'dbfs:/schemas/my_table_schema.json')
);
This does NOT work ā Auto Loader treats the value as a literal āid INT, sal DECIMALā¦ā style string, not as a path.
Questions:
Goal:
I want a single JSON schema file per source, and have multiple DLT SQL pipelines reuse it, while still preventing decimal columns from being inferred as INT.
Any suggestions or patterns (e.g., using Python to read the JSON and set pipeline configuration, schema evolution tricks, or alternative options in Auto Loader) would be really helpful.