cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Data processing and validation with similar set of schema using databricks

Shivap
New Contributor III

We need to process hundreds of txt files with may be 15 different formats based on the file arrival. We need to do basic file validation (header/trailer) before loading them into may be 15 landing tables. Whatโ€™s the best way to process them into landing tables using Databricks. I am thinking of using DLT/Lakeflow for each format. We also need to update the status to different tool as the data get processed. Do you have any suggestions.

1 ACCEPTED SOLUTION

Accepted Solutions

intuz
Contributor II

Hey there, 

Processing many .txt files with different formats and validations is something Databricks handles well. Hereโ€™s a simple approach: 

Recommended Approach: 

  • Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table). 
  • Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms. 
  • Use Auto Loader to watch the input folder and trigger processing as files arrive โ€” it handles schema inference, retries, and scale very well. 
  • After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script. 

 Tips: 

  • Keep format-specific logic modular (one notebook or DLT flow per format). 
  • Track processed files using a checkpoint or metadata table to avoid duplicates. 
  • LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder โ€” especially if teams are collaborating. 

Hope this helps.

View solution in original post

1 REPLY 1

intuz
Contributor II

Hey there, 

Processing many .txt files with different formats and validations is something Databricks handles well. Hereโ€™s a simple approach: 

Recommended Approach: 

  • Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table). 
  • Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms. 
  • Use Auto Loader to watch the input folder and trigger processing as files arrive โ€” it handles schema inference, retries, and scale very well. 
  • After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script. 

 Tips: 

  • Keep format-specific logic modular (one notebook or DLT flow per format). 
  • Track processed files using a checkpoint or metadata table to avoid duplicates. 
  • LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder โ€” especially if teams are collaborating. 

Hope this helps.