<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: AutoLoader options includeExistingFiles and modifiedAfter not working in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131582#M49149</link>
    <description>&lt;P&gt;Always use a brand new, clean checkpoint location when starting your stream to skip existing files.&lt;/P&gt;&lt;P&gt;Example checkpoint path: abfss://&amp;lt;container&amp;gt;@&amp;lt;account&amp;gt;.dfs.core.windows.net/&amp;lt;path&amp;gt;/checkpoints/autoloader_run1&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Avoid .trigger(availableNow=True) if you want incremental processing.&lt;/P&gt;&lt;P&gt;Prefer default micro-batch triggers which process new files incrementally, respecting includeExistingFiles=false and checkpoint metadata.&lt;/P&gt;&lt;P&gt;Validate and format timestamps correctly in cloudFiles.modifiedAfter.&lt;/P&gt;&lt;P&gt;Use ISO8601 format with timezone info, e.g., 2025-09-09T00:00:00.000Z.&lt;/P&gt;&lt;P&gt;Ensure that the files to be skipped have correct last modified timestamps matching your filter criteria.&lt;/P&gt;&lt;P&gt;Verify source folder contents.&lt;/P&gt;&lt;P&gt;Remove or archive old files from the source directory if possible.&lt;/P&gt;&lt;P&gt;Use directory listing or file notifications properly configured for your cloud storage source.&lt;/P&gt;&lt;P&gt;Ensure stream is configured with these options:&lt;/P&gt;&lt;P&gt;python&lt;BR /&gt;df_stream = spark.readStream \&lt;BR /&gt;.format("cloudFiles") \&lt;BR /&gt;.option("cloudFiles.format", "text") \&lt;BR /&gt;.option("cloudFiles.includeExistingFiles", "false") \&lt;BR /&gt;.option("cloudFiles.modifiedAfter", "2025-09-09T00:00:00.000Z") \ # proper ISO8601 format&lt;BR /&gt;.load(LANDED_PATH)&lt;/P&gt;&lt;P&gt;query = df_stream.writeStream \&lt;BR /&gt;.format("delta") \&lt;BR /&gt;.option("checkpointLocation", NEW_CHECKPOINT_PATH) \ # brand new checkpoint path&lt;BR /&gt;.start(OUTPUT_PATH)&lt;BR /&gt;If old files persist in processing, delete and recreate the checkpoint folder or use a new unique checkpoint location each run.&lt;/P&gt;&lt;P&gt;Use Auto Loader metrics and logs to monitor which files get processed to identify unexpected behaviors.&lt;/P&gt;</description>
    <pubDate>Wed, 10 Sep 2025 18:54:34 GMT</pubDate>
    <dc:creator>ManojkMohan</dc:creator>
    <dc:date>2025-09-10T18:54:34Z</dc:date>
    <item>
      <title>AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131477#M49097</link>
      <description>&lt;P&gt;I'm using this code to read data from an ADLS Gen2 location. There are txt files present in sub-folders in the container.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;    df_stream = spark.readStream \
        .format("cloudFiles") \
        .option("cloudFiles.format", "text") \
        .option('cloudFiles.includeExistingFiles', "false") \
        .option('cloudFiles.modifiedAfter', '2025-09-09 00:00:00.000000 UTC+0') \
        .format("text") \
        .load(LANDED_PATH) &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;All the old files in the location I need to skip as they have been processed previously. I followed the documentation and used&amp;nbsp;includeExistingFiles&amp;nbsp; and&amp;nbsp;modifiedAfter&amp;nbsp; option but these aren't working. The old files are still getting processed. Why are these options not working?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 05:45:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131477#M49097</guid>
      <dc:creator>tabinashabir</dc:creator>
      <dc:date>2025-09-10T05:45:52Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131482#M49100</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/183834"&gt;@tabinashabir&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I think one explanation why&amp;nbsp;includeExistingFiles doesn't work is because t&lt;SPAN&gt;his option is &lt;STRONG&gt;evaluated only when you start a stream for the first time&lt;/STRONG&gt;. Changing this option after restarting the stream has no effect.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;But it's weird why modifiedAfter doesn't work in you case. Maybe try to experiment with different timestamp string?&amp;nbsp;&lt;BR /&gt;Something like this article from medium suggests?&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1757486357980.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19883iE9495C374AC538F4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1757486357980.png" alt="szymon_dybczak_0-1757486357980.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 06:39:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131482#M49100</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-10T06:39:54Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131483#M49101</link>
      <description>&lt;P&gt;Root cause:&lt;/P&gt;&lt;P&gt;includeExistingFiles is only evaluated the first time the stream is started with a fresh checkpoint. If the stream is restarted or the checkpoint folder is reused, changing this option will have no effect on subsequent runs—old files previously seen or untracked can be reprocessed&lt;/P&gt;&lt;P&gt;To reliably skip existing files, you must use a new (clean) checkpoint location when starting the stream. Otherwise, Auto Loader will refer to the checkpoint's metadata of already-processed files and the original setting at the stream's initialization.&lt;/P&gt;&lt;P&gt;Solution:&lt;/P&gt;&lt;P&gt;1. Identify Your Container Path and Set a Clean Checkpoint Location&lt;BR /&gt;Determine the correct storage path (e.g., for ADLS Gen2: abfss://&amp;lt;container&amp;gt;@&amp;lt;account&amp;gt;.dfs.core.windows.net/&amp;lt;path&amp;gt;).&lt;/P&gt;&lt;P&gt;Choose a brand new checkpoint location that Auto Loader has never used. Example: abfss://&amp;lt;container&amp;gt;@&amp;lt;account&amp;gt;.dfs.core.windows.net/&amp;lt;path&amp;gt;/checkpoints/autoloader_run1.&lt;/P&gt;&lt;P&gt;Old checkpoint locations must not be reused; they store processed file metadata.&lt;/P&gt;&lt;P&gt;2. Set the Key Options in the Stream Reader&lt;BR /&gt;Use .option("cloudFiles.includeExistingFiles", "false") to exclude files present before the stream starts.&lt;/P&gt;&lt;P&gt;(Optional) Add .option("cloudFiles.modifiedAfter", "&amp;lt;timestamp&amp;gt;") if you know new files will have a modified time after the specified value.&lt;/P&gt;&lt;P&gt;Specify the correct cloudFiles.format (e.g., "text" for txt files).&lt;/P&gt;&lt;P&gt;3. Start the Stream (Python Example)&lt;BR /&gt;Replace &amp;lt;LANDED_PATH&amp;gt; and &amp;lt;CHECKPOINT_PATH&amp;gt; with your paths.&lt;/P&gt;&lt;P&gt;python&lt;BR /&gt;df_stream = spark.readStream \&lt;BR /&gt;.format("cloudFiles") \&lt;BR /&gt;.option("cloudFiles.format", "text") \&lt;BR /&gt;.option("cloudFiles.includeExistingFiles", "false") \&lt;BR /&gt;.option("cloudFiles.modifiedAfter", "2025-09-09T00:00:00.000Z") \ # Optional for fine control&lt;BR /&gt;.load(LANDED_PATH)&lt;/P&gt;&lt;P&gt;query = df_stream.writeStream \&lt;BR /&gt;.format("delta") \&lt;BR /&gt;.option("checkpointLocation", CHECKPOINT_PATH) \&lt;BR /&gt;.start(OUTPUT_PATH)&lt;BR /&gt;Do not reuse any checkpoint directories from previous runs.&lt;/P&gt;&lt;P&gt;4. Validate File Processing&lt;BR /&gt;Upload a test file to your source folder. Only files that arrive after the streaming job starts will be ingested.&lt;/P&gt;&lt;P&gt;Old files will be skipped unless their lastModified attribute is after your specified timestamp in cloudFiles.modifiedAfter.&lt;/P&gt;&lt;P&gt;5. Forcing Exclusion of Old Files&lt;BR /&gt;If any old file does get processed, re-check your checkpoint is new and the file's lastModified attribute is correct.&lt;/P&gt;&lt;P&gt;If re-running or reconfiguring, always delete or provide a new checkpoint directory.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let me know if it works&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 06:42:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131483#M49101</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-10T06:42:57Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131580#M49147</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;Thanks for your reply.&lt;BR /&gt;&lt;BR /&gt;I did try multiple date formats supported in Azure Databricks. I also tried using&amp;nbsp;&lt;BR /&gt;.option('cloudFiles.includeExistingFiles', False). It still didn't work.&lt;BR /&gt;&lt;BR /&gt;I cleared the checkpoints each time.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 18:37:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131580#M49147</guid>
      <dc:creator>tabinashabir</dc:creator>
      <dc:date>2025-09-10T18:37:07Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131581#M49148</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;Thanks for your reply.&lt;BR /&gt;&lt;BR /&gt;The checkpoint location is set to a new path. I tried multiple approaches and cleared the checkpoint location each time so it's considered as the first run. I followed all the steps you've mentioned above. Still processing all old files.&lt;BR /&gt;&lt;BR /&gt;The only difference I see is that I'm using trigger&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;while doing the write stream.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df_stream.writeStream \
.format("delta") \
.trigger(availableNow=True) \
.option("checkpointLocation", CHECKPOINT_PATH) \
.start(OUTPUT_PATH)&lt;/LI-CODE&gt;&lt;P&gt;I did try multiple date formats supported in Azure Databricks. I also tried using&amp;nbsp;&lt;BR /&gt;.option('cloudFiles.includeExistingFiles', False). It still didn't work.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 18:44:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131581#M49148</guid>
      <dc:creator>tabinashabir</dc:creator>
      <dc:date>2025-09-10T18:44:03Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader options includeExistingFiles and modifiedAfter not working</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131582#M49149</link>
      <description>&lt;P&gt;Always use a brand new, clean checkpoint location when starting your stream to skip existing files.&lt;/P&gt;&lt;P&gt;Example checkpoint path: abfss://&amp;lt;container&amp;gt;@&amp;lt;account&amp;gt;.dfs.core.windows.net/&amp;lt;path&amp;gt;/checkpoints/autoloader_run1&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Avoid .trigger(availableNow=True) if you want incremental processing.&lt;/P&gt;&lt;P&gt;Prefer default micro-batch triggers which process new files incrementally, respecting includeExistingFiles=false and checkpoint metadata.&lt;/P&gt;&lt;P&gt;Validate and format timestamps correctly in cloudFiles.modifiedAfter.&lt;/P&gt;&lt;P&gt;Use ISO8601 format with timezone info, e.g., 2025-09-09T00:00:00.000Z.&lt;/P&gt;&lt;P&gt;Ensure that the files to be skipped have correct last modified timestamps matching your filter criteria.&lt;/P&gt;&lt;P&gt;Verify source folder contents.&lt;/P&gt;&lt;P&gt;Remove or archive old files from the source directory if possible.&lt;/P&gt;&lt;P&gt;Use directory listing or file notifications properly configured for your cloud storage source.&lt;/P&gt;&lt;P&gt;Ensure stream is configured with these options:&lt;/P&gt;&lt;P&gt;python&lt;BR /&gt;df_stream = spark.readStream \&lt;BR /&gt;.format("cloudFiles") \&lt;BR /&gt;.option("cloudFiles.format", "text") \&lt;BR /&gt;.option("cloudFiles.includeExistingFiles", "false") \&lt;BR /&gt;.option("cloudFiles.modifiedAfter", "2025-09-09T00:00:00.000Z") \ # proper ISO8601 format&lt;BR /&gt;.load(LANDED_PATH)&lt;/P&gt;&lt;P&gt;query = df_stream.writeStream \&lt;BR /&gt;.format("delta") \&lt;BR /&gt;.option("checkpointLocation", NEW_CHECKPOINT_PATH) \ # brand new checkpoint path&lt;BR /&gt;.start(OUTPUT_PATH)&lt;BR /&gt;If old files persist in processing, delete and recreate the checkpoint folder or use a new unique checkpoint location each run.&lt;/P&gt;&lt;P&gt;Use Auto Loader metrics and logs to monitor which files get processed to identify unexpected behaviors.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 18:54:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-options-includeexistingfiles-and-modifiedafter-not/m-p/131582#M49149</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-10T18:54:34Z</dc:date>
    </item>
  </channel>
</rss>

