<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic duplicate files in delta table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/duplicate-files-in-delta-table/m-p/61257#M31745</link>
    <description>&lt;P&gt;&lt;SPAN&gt;I am facing this issue from long time but so far there is no solution. I have delta table. My bronze layer is picking up the old files (mostly 8 days old file) randomly. My source of files is azure blob storage.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 20 Feb 2024 13:10:42 GMT</pubDate>
    <dc:creator>jaimeperry12345</dc:creator>
    <dc:date>2024-02-20T13:10:42Z</dc:date>
    <item>
      <title>duplicate files in delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/duplicate-files-in-delta-table/m-p/61257#M31745</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I am facing this issue from long time but so far there is no solution. I have delta table. My bronze layer is picking up the old files (mostly 8 days old file) randomly. My source of files is azure blob storage.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Feb 2024 13:10:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/duplicate-files-in-delta-table/m-p/61257#M31745</guid>
      <dc:creator>jaimeperry12345</dc:creator>
      <dc:date>2024-02-20T13:10:42Z</dc:date>
    </item>
    <item>
      <title>Re: duplicate files in delta table</title>
      <link>https://community.databricks.com/t5/data-engineering/duplicate-files-in-delta-table/m-p/61301#M31756</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/99998"&gt;@jaimeperry12345&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I will need more information to direct you in the right direction:&amp;nbsp;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Confirm the behavior:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Double-check that your Delta table is indeed reading 8-day-old files randomly.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;Provide any logs or error messages you have regarding this.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Expected behavior:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Explain how the table should be functioning ideally.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;Are you expecting it to pick up the latest files only?&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;SPAN&gt;Looking at the current details you mentioned please check:&lt;/SPAN&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Check File timestamps:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Verify that the file timestamps on Azure Blob Storage accurately reflect the actual creation time.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;Inconsistent timestamps can mislead the Delta Lake autoloader.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Review Autoloader Configuration:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Ensure your Delta Lake autoloader configuration points to the correct directory and includes parameters like&amp;nbsp;&lt;/SPAN&gt;minPartitions&lt;SPAN&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;partitionBy&lt;SPAN&gt;&amp;nbsp;appropriately.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Spark Configuration:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Make sure your Spark session configuration doesn't have any settings that might interfere with reading the latest files (e.&lt;/SPAN&gt;&lt;SPAN&gt;g.,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;caching or checkpointing).&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cluster Termination:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;If you're using a managed Databricks cluster,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ensure it's not automatically terminating and restarting,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;as this can sometimes cause the autoloader to pick up older files.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Logs and Diagnostics:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Analyze the Delta Lake logs and Spark driver logs for any clues about what might be causing the issue.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;There might be specific error messages or warnings related to the autoloader.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Follow ups are appreciated!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Feb 2024 01:37:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/duplicate-files-in-delta-table/m-p/61301#M31756</guid>
      <dc:creator>Palash01</dc:creator>
      <dc:date>2024-02-21T01:37:20Z</dc:date>
    </item>
  </channel>
</rss>

