cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Multiple Autoloader reading the same directory path

Gilg
Contributor II

Hi

Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.

Question.

Does this create file locks when pipeline 1 reads these files using Autoloader?

Cheers,

 

4 REPLIES 4

Palash01
Valued Contributor

Hey @Gilg 

Thanks for bringing up your concern. Let's delve into this running two Delta Live pipelines reading from the same directory path in continuous mode, even with different catalogs, will not create file locks why I think so?:

  1. Each pipeline's Autoloader creates separate read cursors, ensuring they process different data partitions within the directory.
  2. A data storage layer built on top of Lake File Store (LFS), which is optimized for concurrent reads.
  3. Continuous mode triggers the pipeline whenever new files appear in the source directory.
  4. Each pipeline instance acts independently, meaning they don't coordinate or interfere with each other's reading process.

 

Leave a like if this helps! Kudos,
Palash

Thanks @Retired_mod 

The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.
 
The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet that criterion before it triggers the micro-batch. 

Also, one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. So, maybe the majority of the time spent on listing the directory and maintaining the checkpoint. If so, is there a method to reduce this behavior.

Lastly, when the micro-batch process is done, the records seem up to date. 

Hi @Retired_mod 

We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.

One thing I also notices is that the Scheduler Delay seems taking it too long like 1hr upto 2hrs.

We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.

cgrant
Databricks Employee
Databricks Employee

To answer the original question, autoloader does not use locks when reading files. You are however limited by the underlying storage system, ADLS in this example.

Going by what has been mentioned (long batch times, but spark jobs finish really fast) it sounds like you are limited by listing the directory. For high volume setups where source directories are not cleaned up of old files, we recommend using file notification mode - this works well as it avoids listing historical directories to find new files

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group