topic Re: Multiple Autoloader reading the same directory path in Data Engineering

Multiple Autoloader reading the same directory path

Gilg — Tue, 13 Feb 2024 23:27:53 GMT

Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.

Question.

Does this create file locks when pipeline 1 reads these files using Autoloader?

Cheers,

Re: Multiple Autoloader reading the same directory path

Palash01 — Wed, 14 Feb 2024 06:18:49 GMT

Hey @Gilg

Thanks for bringing up your concern. Let's delve into this running two Delta Live pipelines reading from the same directory path in continuous mode, even with different catalogs, will not create file locks why I think so?:

Each pipeline's Autoloader creates separate read cursors, ensuring they process different data partitions within the directory.
A data storage layer built on top of Lake File Store (LFS), which is optimized for concurrent reads.
Continuous mode triggers the pipeline whenever new files appear in the source directory.
Each pipeline instance acts independently, meaning they don't coordinate or interfere with each other's reading process.

Re: Multiple Autoloader reading the same directory path

Gilg — Wed, 14 Feb 2024 20:43:30 GMT

Thanks @Retired_mod

The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.

The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet that criterion before it triggers the micro-batch.

Also, one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. So, maybe the majority of the time spent on listing the directory and maintaining the checkpoint. If so, is there a method to reduce this behavior.

Lastly, when the micro-batch process is done, the records seem up to date.

Re: Multiple Autoloader reading the same directory path

Gilg — Fri, 23 Feb 2024 01:27:51 GMT

Hi @Retired_mod

We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.

One thing I also notices is that the Scheduler Delay seems taking it too long like 1hr upto 2hrs.

We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.

Re: Multiple Autoloader reading the same directory path

cgrant — Fri, 06 Dec 2024 20:57:21 GMT

To answer the original question, autoloader does not use locks when reading files. You are however limited by the underlying storage system, ADLS in this example.

Going by what has been mentioned (long batch times, but spark jobs finish really fast) it sounds like you are limited by listing the directory. For high volume setups where source directories are not cleaned up of old files, we recommend using file notification mode - this works well as it avoids listing historical directories to find new files