Databricks Community

Gilg · ‎02-13-2024

Hi

Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.

Question.

Does this create file locks when pipeline 1 reads these files using Autoloader?

Cheers,

Palash01 · ‎02-13-2024

Hey @Gilg

Thanks for bringing up your concern. Let's delve into this running two Delta Live pipelines reading from the same directory path in continuous mode, even with different catalogs, will not create file locks why I think so?:

Each pipeline's Autoloader creates separate read cursors, ensuring they process different data partitions within the directory.
A data storage layer built on top of Lake File Store (LFS), which is optimized for concurrent reads.
Continuous mode triggers the pipeline whenever new files appear in the source directory.
Each pipeline instance acts independently, meaning they don't coordinate or interfere with each other's reading process.

Leave a like if this helps! Kudos,
Palash

Gilg · ‎02-14-2024

Thanks @Retired_mod

The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.

The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet that criterion before it triggers the micro-batch.

Also, one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. So, maybe the majority of the time spent on listing the directory and maintaining the checkpoint. If so, is there a method to reduce this behavior.

Lastly, when the micro-batch process is done, the records seem up to date.

Gilg · ‎02-22-2024

Hi @Retired_mod

We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.

One thing I also notices is that the Scheduler Delay seems taking it too long like 1hr upto 2hrs.

We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.

cgrant · ‎12-06-2024

To answer the original question, autoloader does not use locks when reading files. You are however limited by the underlying storage system, ADLS in this example.

Going by what has been mentioned (long batch times, but spark jobs finish really fast) it sounds like you are limited by listing the directory. For high volume setups where source directories are not cleaned up of old files, we recommend using file notification mode - this works well as it avoids listing historical directories to find new files

Databricks Community

Multiple Autoloader reading the same directory path

Join Us as a Local Community Builder!

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Databricks Specialist Sessions

🚀 Weekly Delta (24-30 September): A Look Back at This Week’s Top Community Highlights!

Announcing Data Intelligence for Cybersecurity

🌟 Community Sparks of the Week | September 19 – 25 🌟