Databricks Community

Shivap · ‎07-08-2025

We need to process hundreds of txt files with may be 15 different formats based on the file arrival. We need to do basic file validation (header/trailer) before loading them into may be 15 landing tables. What’s the best way to process them into landing tables using Databricks. I am thinking of using DLT/Lakeflow for each format. We also need to update the status to different tool as the data get processed. Do you have any suggestions.

intuz · ‎07-08-2025

Hey there,

Processing many .txt files with different formats and validations is something Databricks handles well. Here’s a simple approach:

Recommended Approach:

Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table).
Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms.
Use Auto Loader to watch the input folder and trigger processing as files arrive — it handles schema inference, retries, and scale very well.
After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script.

Tips:

Keep format-specific logic modular (one notebook or DLT flow per format).
Track processed files using a checkpoint or metadata table to avoid duplicates.
LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder — especially if teams are collaborating.

Hope this helps.

View solution in original post

intuz · ‎07-08-2025

Hey there,

Processing many .txt files with different formats and validations is something Databricks handles well. Here’s a simple approach:

Recommended Approach:

Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table).
Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms.
Use Auto Loader to watch the input folder and trigger processing as files arrive — it handles schema inference, retries, and scale very well.
After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script.

Tips:

Keep format-specific logic modular (one notebook or DLT flow per format).
Track processed files using a checkpoint or metadata table to avoid duplicates.
LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder — especially if teams are collaborating.

Hope this helps.

Databricks Community

Data processing and validation with similar set of schema using databricks

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐