<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Delt Live Table Path/Directory help in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89522#M37841</link>
    <description>&lt;P&gt;Do I understand correctly, that the requirement is to exclude the files from today? &lt;STRONG&gt;Yes, excatly&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;In the code shared you say you are doing .load(blobName"/"+FolderName+"/*.json"), so it should pick up all the files that were not processed previously. this is the problem. it doesn't process all files.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, What I am trying to process are the files that were excluded from yesterday's run. For example, today (September 11th), the pipeline should process all files up to yesterday, excluding today's files since they're incomplete. Tomorrow (September 12th), the pipeline should process all files up to yesterday, including the files that were excluded yesterday, but still exclude today's files. In essence, the pipeline should always exclude files from the current run day but can process any files prior to that day.&lt;/P&gt;</description>
    <pubDate>Wed, 11 Sep 2024 21:24:07 GMT</pubDate>
    <dc:creator>standup1</dc:creator>
    <dc:date>2024-09-11T21:24:07Z</dc:date>
    <item>
      <title>Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89514#M37837</link>
      <description>&lt;P&gt;Hello, I am working on a dlt pipeline and I've been facing an issue. I hope someone here can help me find a solution.&lt;BR /&gt;My files are json in azure storage. These files are stored in dircctory like this ( blobName/FolderName/xx.csv).&lt;BR /&gt;The folder name is like this ( 2024-08-20T12.00.00Z). We get these files all day and the name changes. The problem I have is that when I run the pipeline I should exclude today's files .&lt;/P&gt;&lt;P&gt;Here's part of my script&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;.&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(blobName"/"&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;FolderName&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;/*.json"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;, so if I run it today, I will do &lt;SPAN&gt;&amp;nbsp;.&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(blobName"/"&lt;/SPAN&gt;&lt;SPAN&gt;+"&lt;/SPAN&gt;&lt;SPAN&gt;2024-09-11*Z"&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;/*.json"&lt;/SPAN&gt;&lt;SPAN&gt;). This will work fine and pick up all files from 2024-09-11. The problem is the following day. When I change it to .load(blobName"/"+"2024-09-12*Z"+"/*.csv"), it doesn't pick up all the files. I think it only picks up the new files that came after yesterday's run time. Is there anything I am doing wrong with those "*" wild card in my path? I appreciate any help. &lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2024 19:33:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89514#M37837</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-09-11T19:33:21Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89518#M37838</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102950"&gt;@standup1&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Autoloader is using checkpointing mechanism to track files that have been processed.&amp;nbsp;When you run your DLT pipeline, Autoloader remembers the state of files processed during the last run and it will not reprocess files that were previously processed.&lt;/P&gt;&lt;P&gt;Do you really want to reprocess the existing files from previous days?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2024 19:44:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89518#M37838</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-11T19:44:43Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89519#M37839</link>
      <description>&lt;P&gt;Thanks for your reply. That makes a lot of sense. So it looks like dlt will scan those files but it won’t load them to the df, next day I run it to pick up previous day data. It only brings some of them. I need to reprocess those files, but only from yesterday. Do you know if there’s any workaround ?i tried&amp;nbsp;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.includeExistingFiles",”true”) but that didn’t do it.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2024 19:56:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89519#M37839</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-09-11T19:56:52Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89521#M37840</link>
      <description>&lt;P&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"cloudFiles.includeExistingFiles",”true”) will not work.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This option is responsible for processing the files in the folder when autoloader is started, so it matters only in the first run of autoloader and it is default to true.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Could you once again explain what you are trying to achieve?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Why do you want to reprocess the already processed files?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Do I understand correctly, that the requirement is to exclude the files from today? &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;In the code shared you say you are doing&amp;nbsp;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(blobName"/"&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;FolderName&lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;/*.json"&lt;/SPAN&gt;&lt;SPAN&gt;), so it should pick up all the files that were not processed previously.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2024 20:29:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89521#M37840</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-11T20:29:04Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89522#M37841</link>
      <description>&lt;P&gt;Do I understand correctly, that the requirement is to exclude the files from today? &lt;STRONG&gt;Yes, excatly&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;In the code shared you say you are doing .load(blobName"/"+FolderName+"/*.json"), so it should pick up all the files that were not processed previously. this is the problem. it doesn't process all files.&amp;nbsp;&lt;/P&gt;&lt;P&gt;So, What I am trying to process are the files that were excluded from yesterday's run. For example, today (September 11th), the pipeline should process all files up to yesterday, excluding today's files since they're incomplete. Tomorrow (September 12th), the pipeline should process all files up to yesterday, including the files that were excluded yesterday, but still exclude today's files. In essence, the pipeline should always exclude files from the current run day but can process any files prior to that day.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Sep 2024 21:24:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89522#M37841</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-09-11T21:24:07Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89552#M37853</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102950"&gt;@standup1&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;To do so you need to calculate today_date, add file_date to dataframe and then to get only records where file_date less than today_date:&lt;/P&gt;&lt;P&gt;My folders:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1726124068229.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11096iB71B88733323DF5C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1726124068229.png" alt="filipniziol_0-1726124068229.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The code to get files excluding today folders based on folder name&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1726124195128.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11097i14DAEE1D6F441CDE/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1726124195128.png" alt="filipniziol_1-1726124195128.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Output:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_2-1726124354828.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11098i87FE473B3BF5E9A1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_2-1726124354828.png" alt="filipniziol_2-1726124354828.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Hope it helps&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Sep 2024 07:00:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89552#M37853</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-12T07:00:18Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89653#M37878</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117376"&gt;@filipniziol&lt;/a&gt; ,&lt;/P&gt;&lt;P&gt;Thank you so much for sharing this example with me. This is very helpful. I appreciate your help.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Sep 2024 15:12:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89653#M37878</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-09-12T15:12:56Z</dc:date>
    </item>
    <item>
      <title>Re: Delt Live Table Path/Directory help</title>
      <link>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89654#M37879</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102950"&gt;@standup1&lt;/a&gt;&amp;nbsp;, I'm glad the example was helpful&lt;/P&gt;</description>
      <pubDate>Thu, 12 Sep 2024 15:26:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delt-live-table-path-directory-help/m-p/89654#M37879</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-12T15:26:29Z</dc:date>
    </item>
  </channel>
</rss>

