<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT Potential Bug: File Reprocessing Issue with &amp;quot;cloudFiles.allowOverwrites&amp;quot;: &amp;quot;tr in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97266#M39463</link>
    <description>&lt;P&gt;Apologies, that could be the internet or networking issue.&lt;/P&gt;
&lt;P&gt;So, in DLT you will be able to change the DBR but will have to use custom image, it may be tricky if you have not done it earlier.&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;By default, photon will be used in serverelss.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;It may be a stretch but can you try the workload, on an interactive cluster 15.3 +&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Or, add these two configs in the DLT Advanced configs as a workaround (if photon is involved)&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;spark.databricks.photon.scan.enabled false
spark.databricks.photon.jsonScan.enabled false&lt;/LI-CODE&gt;
&lt;P&gt;Photon, not sure which&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 01 Nov 2024 17:25:59 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2024-11-01T17:25:59Z</dc:date>
    <item>
      <title>DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "true"</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/96730#M39329</link>
      <description>&lt;P&gt;Hi there, I ran into a peculiar case and I'm wondering if anyone else has run into this and can offer an explanation. We have a DLT process to pull CSV files from a landing location and insert (append) them into target tables. We have the setting&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;"cloudFiles.allowOverwrites": "true"&lt;/LI-CODE&gt;&lt;P&gt;because it's possible (likely even) that a file will arrive the first time either empty or partially filled, and the same file can be overridden with more complete data later. We are ok with data duplication&amp;nbsp;(de-duplication is handled downstream after DLT), but we are *not* ok with a file being updated and it being skipped in a later DLT insert, but this &lt;STRIKE&gt;seems to be&lt;/STRIKE&gt;&amp;nbsp;is verifiably the case every now and then (roughly ~2% of the time).&lt;/P&gt;&lt;P&gt;For additional context, say we have 10 daily files (file sizes can range, but let's say anywhere in the realm between a few thousand to a few million records in these files). These files can arrive and be inserted via the DLT process initially, but we expect/need for these subsequent file updates (let's say 3-4 times a day) to each be re-inserted into the target tables via the DLT process. Once a file is inserted (or re-inserted) a separate non-DLT process is kicked off to de-duplicate the data, and this is all set up either on a scheduled workflow (3-4 times a day) or on a manual run as well. While not likely or the norm, there is the potential for a manual run and a scheduled run to both be running very close or on top of each other, however, we are seeing the roughly ~2% of times when updated files are not re-inserted via DLT and I'm not sure that this edge case would explain this high of a failure rate.&lt;/P&gt;&lt;P&gt;Is there any known issue with the DLT process when&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;"cloudFiles.allowOverwrites": "true"&lt;/LI-CODE&gt;&lt;P&gt;?&lt;/P&gt;&lt;P&gt;Does this issue line up with any other similar reported issues with DLT??&lt;/P&gt;&lt;P&gt;Any feedback on this issue/bug would be much appreciated!&lt;/P&gt;</description>
      <pubDate>Wed, 30 Oct 2024 01:48:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/96730#M39329</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-10-30T01:48:50Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97037#M39402</link>
      <description>&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Ensure Unique Timestamps:&lt;/STRONG&gt; Verify that each file update includes a unique modification timestamp, as this can help DLT detect and reprocess updated files.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Use cloudFiles.validateOptions: &lt;/STRONG&gt;Set "cloudFiles.validateOptions": "true" to help DLT verify files more strictly against changes.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Monitor DLT Logs:&lt;/STRONG&gt; Check the logs to confirm that DLT is detecting each file update. Any skipped file updates should log a reason, which may help pinpoint the cause.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This combination of Auto Loader caching, file deduplication, and potential concurrency overlap could be behind the ~2% miss rate you’re seeing. Checking and adjusting these areas should help improve re-insertion consistency.&lt;/P&gt;
&lt;P&gt;If none of the above is feasible, I would like to suggest reaching out to our support team for a better use case evaluation and consider all possible options, as there many other aspects to consider here, e.g.: use of Photon, DBR release version, check if input files are mutated, the files update frequency, etc.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 16:57:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97037#M39402</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2024-10-31T16:57:01Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97042#M39404</link>
      <description>&lt;P&gt;Great feedback!&lt;/P&gt;&lt;P&gt;Can you please provide a bit more context or example about which &lt;U&gt;&lt;STRONG&gt;DLT logs&lt;/STRONG&gt;&lt;/U&gt; to monitor? I tried looking into logs, but likely I'm not digging in the right place and the logs I found were completely overwhelming to dig through.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 17:02:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97042#M39404</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-10-31T17:02:14Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97050#M39406</link>
      <description>&lt;P&gt;In the pipeline side panel, go to "Update Details" (seen on right hand side) and then spark logs can be seen.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-10-31 at 10.39.16 PM.png" style="width: 875px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12445iBB01C6A71B306E96/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2024-10-31 at 10.39.16 PM.png" alt="Screenshot 2024-10-31 at 10.39.16 PM.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 17:10:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97050#M39406</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-10-31T17:10:31Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97057#M39408</link>
      <description>&lt;P&gt;In the bottom of the side panel, you can also see another "view logs" option, which gives you details of DLT&amp;nbsp;&lt;SPAN&gt;Pipeline event log details. Click on them and a pop up will appear.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-10-31 at 10.41.09 PM.png" style="width: 755px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12446iEEC8856C708A42E6/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2024-10-31 at 10.41.09 PM.png" alt="Screenshot 2024-10-31 at 10.41.09 PM.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-10-31 at 10.42.00 PM.png" style="width: 822px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12447iF8F5A9D41CB33C47/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2024-10-31 at 10.42.00 PM.png" alt="Screenshot 2024-10-31 at 10.42.00 PM.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-10-31 at 10.42.12 PM.png" style="width: 652px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/12448i9902DC7CD5DB1C53/image-size/large?v=v2&amp;amp;px=999" role="button" title="Screenshot 2024-10-31 at 10.42.12 PM.png" alt="Screenshot 2024-10-31 at 10.42.12 PM.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 17:13:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97057#M39408</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-10-31T17:13:07Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97059#M39410</link>
      <description>&lt;P&gt;Meanwhile, on digging further&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Autoloader, in general, is highly recommended for ingestion where the files are immutable. While there are configurations (i.e. cloudFiles.allowOverwrites = True) that allow for updated files to be re-ingested in the source, AL only guarantees only once semantics when allowOverwrites is not enabled. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Can you please help me with two details&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;- which DBR are you are on (please try with latest)&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;- Are you using Photon?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Oct 2024 17:17:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97059#M39410</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-10-31T17:17:55Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97237#M39461</link>
      <description>&lt;P&gt;Hi.. I've tried to respond 3 times already but there seems to be an issue with DBX Community and each time my post shows as successful, and I refresh the page and it looks good... but then I check in later (e.g. 30 mins later) and my post is GONE! ...&lt;/P&gt;&lt;P&gt;I have more context, but for now to answer your direct questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;dlt:14.1.21-delta-pipelines-dlt-release-2024.40-rc0-commit-b9997c9-image-963030d&lt;/LI&gt;&lt;LI&gt;Not using Photon&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Fri, 01 Nov 2024 14:51:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97237#M39461</guid>
      <dc:creator>ChristianRRL</dc:creator>
      <dc:date>2024-11-01T14:51:43Z</dc:date>
    </item>
    <item>
      <title>Re: DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "tr</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97266#M39463</link>
      <description>&lt;P&gt;Apologies, that could be the internet or networking issue.&lt;/P&gt;
&lt;P&gt;So, in DLT you will be able to change the DBR but will have to use custom image, it may be tricky if you have not done it earlier.&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;By default, photon will be used in serverelss.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;It may be a stretch but can you try the workload, on an interactive cluster 15.3 +&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Or, add these two configs in the DLT Advanced configs as a workaround (if photon is involved)&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;spark.databricks.photon.scan.enabled false
spark.databricks.photon.jsonScan.enabled false&lt;/LI-CODE&gt;
&lt;P&gt;Photon, not sure which&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2024 17:25:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-potential-bug-file-reprocessing-issue-with-quot-cloudfiles/m-p/97266#M39463</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2024-11-01T17:25:59Z</dc:date>
    </item>
  </channel>
</rss>

