cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Multiple Autoloader reading the same directory path

Gilg
Contributor II

Hi

Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.

Question.

Does this create file locks when pipeline 1 reads these files using Autoloader?

Cheers,

 

5 REPLIES 5

Palash01
Contributor III

Hey @Gilg 

Thanks for bringing up your concern. Let's delve into this running two Delta Live pipelines reading from the same directory path in continuous mode, even with different catalogs, will not create file locks why I think so?:

  1. Each pipeline's Autoloader creates separate read cursors, ensuring they process different data partitions within the directory.
  2. A data storage layer built on top of Lake File Store (LFS), which is optimized for concurrent reads.
  3. Continuous mode triggers the pipeline whenever new files appear in the source directory.
  4. Each pipeline instance acts independently, meaning they don't coordinate or interfere with each other's reading process.

 

Leave a like if this helps! Kudos,
Palash

Kaniz
Community Manager
Community Manager

Hi @Gilg, When multiple pipelines are simultaneously accessing the same directory path and utilizing Autoloader in continuous mode, it is crucial to consider the management of file locks and data consistency carefully. 

 

Let's delve into the specifics of how Autoloader operates in this scenario. 

 

Autoloader is designed to efficiently transfer data from various file types, such as Parquet files, into Delta tables. As it discovers these files, their corresponding metadata is stored in a highly scalable key-value store, specifically RocksDB, located within the checkpoint location of your Autoloader pipeline. This key-value store guarantees that data is only processed once, ensuring accurate and reliable results.

 

The behaviour of file writes is heavily influenced by the method used to upload files to the directory by the source system. Some methods may start by creating an empty file and then gradually add chunks of data, while others may first create a temporary file and then use an atomic operation to rename it. In situations where files are still being written to the directory, Autoloader might come across incomplete files or files that have not yet fully appeared.

 

There is a concerning risk when multiple pipelines are reading from the same directory, as it may result in data loss. This can occur if one pipeline attempts to read a file that is still being written by another. While the autoloader feature does not have built-in safeguards to prevent this, it can still occur if certain conditions are met, such as the way files are uploaded and the atomicity of those actions. To avoid such issues, it is important to carefully consider the protocols and processes surrounding file management within the directory.


 


 

Gilg
Contributor II

Thanks @Kaniz 

The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.
 
The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet that criterion before it triggers the micro-batch. 

Also, one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. So, maybe the majority of the time spent on listing the directory and maintaining the checkpoint. If so, is there a method to reduce this behavior.

Lastly, when the micro-batch process is done, the records seem up to date. 

Kaniz
Community Manager
Community Manager

Hi @Gilg, Itโ€™s great to see you working with data ingestion using Auto Loader in Databricks. 

 

Letโ€™s address your observations and explore potential solutions:

 

Idle Time and Max Files Per Trigger:

  • Youโ€™re correct that Auto Loader waits for a certain number of files before triggering a micro-batch. By default, it processes up to 1000 files per trigger.
  • If your files arrive sporadically or in smaller quantities, this behaviour might lead to idle time.
  • To address this, consider adjusting the maxFilesPerTrigger parameter. You can set it to a lower value that aligns with your data arrival frequency. For example, if files arrive every 10 minutes, you could set it to a smaller number (e.g., 100) to trigger more frequent micro-batches.

Listing Directory and Checkpoint Overhead:

  • The quick job/stage execution you observed suggests that actual data processing is efficient.
  • However, listing the directory and maintaining checkpoints can introduce overhead.
  • To mitigate this:
    • Checkpoint Optimization: Ensure that your checkpoint directory is on a performant storage system (e.g., DBFS, Azure Blob Storage, or S3). Also, consider checkpointing less frequently if possible.
    • Delta Lake: If youโ€™re using Delta Lake, it provides automatic checkpoint management and optimizations. Consider converting your bronze table to Delta format if itโ€™s not already.
    • Schema Inference: If your schema is stable, consider explicitly specifying the schema during read (e.g., using .schema(your_schema)). This avoids schema inference overhead.
    • Evolution of Schema: Auto Loader can automatically detect schema changes. If new columns are introduced, it evolves the table schema without manual intervention1.

Up-to-Date Records:

  • Itโ€™s great that your micro-batch process ensures up-to-date records.
  • Continue monitoring and validating the data to ensure consistency.

Gilg
Contributor II

Hi @Kaniz 

We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.

One thing I also notices is that the Scheduler Delay seems taking it too long like 1hr upto 2hrs.

We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.