โ02-13-2024 03:27 PM
Hi
Originally, I only have 1 pipeline looking to a directory. Now as a test, I cloned the existing pipeline and edited the settings to a different catalog. Now both pipelines is basically reading the same directory path and running continuous mode.
Question.
Does this create file locks when pipeline 1 reads these files using Autoloader?
Cheers,
โ02-13-2024 10:18 PM
Hey @Gilg
Thanks for bringing up your concern. Let's delve into this running two Delta Live pipelines reading from the same directory path in continuous mode, even with different catalogs, will not create file locks why I think so?:
โ02-14-2024 12:33 PM - edited โ02-14-2024 12:43 PM
Thanks @Retired_mod
The files that I am reading is from a Service Bus. These files contain only 1 data in a Json format and contains different sizes from bytes to kb.
The issue that I am getting is that autoloader seems to be in idle for a long time (1.5h) before it writes the data in bronze. I was also thinking because by default autoloader's maxFilesPerTrigger by default is 1000 files for each micro-batch. It seems like autoloader is waiting to meet that criterion before it triggers the micro-batch.
Also, one thing that I noticed when looking at the sparkUI is that jobs/stages are finished within seconds. So, maybe the majority of the time spent on listing the directory and maintaining the checkpoint. If so, is there a method to reduce this behavior.
Lastly, when the micro-batch process is done, the records seem up to date.
โ02-22-2024 05:27 PM
Hi @Retired_mod
We are receiving around 6k worth of files every hour, or 99 files per minute and these files can vary is sizes.
One thing I also notices is that the Scheduler Delay seems taking it too long like 1hr upto 2hrs.
We are already using ADLS Gen2, Bronze table are in Delta format, and not using any schema inference. So not sure what is going on in our DLT pipeline.
yesterday - last edited yesterday
To answer the original question, autoloader does not use locks when reading files. You are however limited by the underlying storage system, ADLS in this example.
Going by what has been mentioned (long batch times, but spark jobs finish really fast) it sounds like you are limited by listing the directory. For high volume setups where source directories are not cleaned up of old files, we recommend using file notification mode - this works well as it avoids listing historical directories to find new files
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group