<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Data processing and validation with similar set of schema using databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-processing-and-validation-with-similar-set-of-schema-using/m-p/124462#M47201</link>
    <description>&lt;P&gt;We need to process hundreds of txt files with may be 15 different formats based on the file arrival. We need to do basic file validation (header/trailer) before loading them into may be 15 landing tables. What’s the best way to process them into landing tables using Databricks. I am thinking of using DLT/Lakeflow for each format. We also need to update the status to different tool as the data get processed. Do you have any suggestions.&lt;/P&gt;</description>
    <pubDate>Tue, 08 Jul 2025 13:49:12 GMT</pubDate>
    <dc:creator>Shivap</dc:creator>
    <dc:date>2025-07-08T13:49:12Z</dc:date>
    <item>
      <title>Data processing and validation with similar set of schema using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/data-processing-and-validation-with-similar-set-of-schema-using/m-p/124462#M47201</link>
      <description>&lt;P&gt;We need to process hundreds of txt files with may be 15 different formats based on the file arrival. We need to do basic file validation (header/trailer) before loading them into may be 15 landing tables. What’s the best way to process them into landing tables using Databricks. I am thinking of using DLT/Lakeflow for each format. We also need to update the status to different tool as the data get processed. Do you have any suggestions.&lt;/P&gt;</description>
      <pubDate>Tue, 08 Jul 2025 13:49:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-processing-and-validation-with-similar-set-of-schema-using/m-p/124462#M47201</guid>
      <dc:creator>Shivap</dc:creator>
      <dc:date>2025-07-08T13:49:12Z</dc:date>
    </item>
    <item>
      <title>Re: Data processing and validation with similar set of schema using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/data-processing-and-validation-with-similar-set-of-schema-using/m-p/124537#M47222</link>
      <description>&lt;P&gt;Hey there,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Processing many .txt files with different formats and validations is something Databricks handles well. Here’s a simple approach:&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Recommended Approach:&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Use DLT (Delta Live Tables) or LakeFlow to build a pipeline per format (if each format maps to a specific schema/landing table).&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Add file validation logic (like checking header/trailer) in your DLT pipeline using Python or SQL transforms.&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Use Auto Loader to watch the input folder and trigger processing as files arrive — it handles schema inference, retries, and scale very well.&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;After loading each file into its landing table, you can update the status in an external tool (like a database or API call) using a foreachBatch or a simple post-write webhook/script.&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Tips:&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Keep format-specific logic modular (one notebook or DLT flow per format).&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Track processed files using a checkpoint or metadata table to avoid duplicates.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;LakeFlow is great if you prefer a visual, no-code/low-code pipeline builder — especially if teams are collaborating.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;SPAN&gt;Hope this helps.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Jul 2025 06:28:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-processing-and-validation-with-similar-set-of-schema-using/m-p/124537#M47222</guid>
      <dc:creator>intuz</dc:creator>
      <dc:date>2025-07-09T06:28:55Z</dc:date>
    </item>
  </channel>
</rss>

