Data processing and validation with similar set of schema using databricks

Shivap — Tue, 08 Jul 2025 13:49:12 GMT

We need to process hundreds of txt files with may be 15 different formats based on the file arrival. We need to do basic file validation (header/trailer) before loading them into may be 15 landing tables. What’s the best way to process them into landing tables using Databricks. I am thinking of using DLT/Lakeflow for each format. We also need to update the status to different tool as the data get processed. Do you have any suggestions.

Re: Data processing and validation with similar set of schema using databricks

intuz — Wed, 09 Jul 2025 06:28:55 GMT

Hey there,

Processing many .txt files with different formats and validations is something Databricks handles well. Here’s a simple approach:

Recommended Approach:

Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table).
Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms.
Use Auto Loader to watch the input folder and trigger processing as files arrive — it handles schema inference, retries, and scale very well.
After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script.

Tips:

Keep format-specific logic modular (one notebook or DLT flow per format).
Track processed files using a checkpoint or metadata table to avoid duplicates.
LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder — especially if teams are collaborating.

Hope this helps.

topic Data processing and validation with similar set of schema using databricks in Data Engineering

Data processing and validation with similar set of schema using databricks

Re: Data processing and validation with similar set of schema using databricks