<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Multiple Autoloader reading the same directory path in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/61627#M31816</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.&lt;/P&gt;&lt;P&gt;One thing I also notices is that the&amp;nbsp;&lt;SPAN&gt;Scheduler Delay seems taking it too long like 1hr upto 2hrs.&lt;BR /&gt;&lt;BR /&gt;We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Feb 2024 01:27:51 GMT</pubDate>
    <dc:creator>Gilg</dc:creator>
    <dc:date>2024-02-23T01:27:51Z</dc:date>
    <item>
      <title>Multiple Autoloader reading the same directory path</title>
      <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60107#M31590</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.&lt;/P&gt;&lt;P&gt;Question.&lt;/P&gt;&lt;P&gt;Does this create file locks when pipeline 1 reads these files using Autoloader?&lt;/P&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Feb 2024 23:27:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60107#M31590</guid>
      <dc:creator>Gilg</dc:creator>
      <dc:date>2024-02-13T23:27:53Z</dc:date>
    </item>
    <item>
      <title>Re: Multiple Autoloader reading the same directory path</title>
      <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60121#M31593</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11184"&gt;@Gilg&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for bringing up your concern. Let's delve into this&amp;nbsp;&lt;SPAN&gt;running two Delta Live pipelines reading from the same directory path in continuous mode,&lt;/SPAN&gt;&lt;SPAN&gt; even with different catalogs,&lt;/SPAN&gt;&lt;SPAN&gt; will not create file locks why I think so?&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;Each pipeline's Autoloader creates separate read cursors,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;ensuring they process different data partitions within the directory.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;A data storage layer built on top of Lake File Store (LFS),&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;which is optimized for concurrent reads.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Continuous&amp;nbsp;mode triggers the pipeline whenever new files appear in the source directory.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Each pipeline instance acts independently,&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;meaning they don't coordinate or interfere with each other's reading process.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Feb 2024 06:18:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60121#M31593</guid>
      <dc:creator>Palash01</dc:creator>
      <dc:date>2024-02-14T06:18:49Z</dc:date>
    </item>
    <item>
      <title>Re: Multiple Autoloader reading the same directory path</title>
      <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60261#M31622</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.&lt;BR /&gt;&amp;nbsp;&lt;BR /&gt;The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's&amp;nbsp;&lt;FONT face="inherit" color="#183139"&gt;maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet &lt;/FONT&gt;&lt;FONT color="#183139"&gt;that criterion&lt;/FONT&gt;&lt;FONT face="inherit" color="#183139"&gt;&amp;nbsp;before it triggers the micro-batch.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#183139"&gt;Also,&lt;/FONT&gt;&lt;FONT color="#183139"&gt;&lt;FONT face="inherit"&gt;&amp;nbsp;one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. &lt;/FONT&gt;So, maybe the majority of the time&amp;nbsp;&lt;FONT face="inherit"&gt;spent on listing the directory and maintaining the checkpoint. If so, is there a method to &lt;/FONT&gt;reduce&lt;FONT face="inherit"&gt;&amp;nbsp;this &lt;/FONT&gt;behavior&lt;FONT face="inherit"&gt;.&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT color="#183139"&gt;Lastly, when&amp;nbsp;&lt;FONT face="inherit"&gt;the micro-batch process is done, the records &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT color="#183139"&gt;seem&lt;/FONT&gt;&lt;FONT face="inherit" color="#183139"&gt;&amp;nbsp;up to date.&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Feb 2024 20:43:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/60261#M31622</guid>
      <dc:creator>Gilg</dc:creator>
      <dc:date>2024-02-14T20:43:30Z</dc:date>
    </item>
    <item>
      <title>Re: Multiple Autoloader reading the same directory path</title>
      <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/61627#M31816</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.&lt;/P&gt;&lt;P&gt;One thing I also notices is that the&amp;nbsp;&lt;SPAN&gt;Scheduler Delay seems taking it too long like 1hr upto 2hrs.&lt;BR /&gt;&lt;BR /&gt;We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Feb 2024 01:27:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/61627#M31816</guid>
      <dc:creator>Gilg</dc:creator>
      <dc:date>2024-02-23T01:27:51Z</dc:date>
    </item>
    <item>
      <title>Re: Multiple Autoloader reading the same directory path</title>
      <link>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/101285#M40615</link>
      <description>&lt;P&gt;To answer the original question, autoloader does not use locks when reading files. You are however limited by the underlying storage system, ADLS in this example.&lt;/P&gt;
&lt;P&gt;Going by what has been mentioned (long batch times, but spark jobs finish really fast) it sounds like you are limited by listing the directory. For high volume setups where source directories are not cleaned up of old files, we recommend using &lt;A href="https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode.html#what-is-auto-loader-file-notification-mode" target="_self"&gt;file notification mode&lt;/A&gt; - this works well as it avoids listing historical directories to find new files&lt;/P&gt;</description>
      <pubDate>Fri, 06 Dec 2024 20:57:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/multiple-autoloader-reading-the-same-directory-path/m-p/101285#M40615</guid>
      <dc:creator>cgrant</dc:creator>
      <dc:date>2024-12-06T20:57:21Z</dc:date>
    </item>
  </channel>
</rss>

