cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to exclude/skip a file temporarily in DLT

standup1
Contributor

Hi,

Is there any way to exclude a file from the dlt pipeline (autoload) run temporarily? What I mean is that I want to be able to exclude a specific file until I decided to include it in the load? I can't control the files or the location where they are stored.

Assume we have the following:

folder/file1

folder/file2

folder/file3

so, I want to run the dlt pipeline only for file 1 and 2. in the future I want to include file 3. I can exclude the file using some filters or parameters like ModifiedBefore...etc and right now the dlt pipeline reads all files and loads file 1 and 2 successfully. However, when I remove my exclusion criteria from the script. the dlt pipeline doesn't include the excluded file, I think the pipeline does register it somewhere or keeps track of all files and doesn't consider it as a new file.

Please let me know if I need to clarify anything further.

I appreciate any help or advice.

2 REPLIES 2

brockb
Valued Contributor

Hi,

I'm not aware of default Autoloader functionality that does what you're looking to do given that Autoloader is designed to incrementally ingest data as it arrives in cloud storage. Can you describe more about: "...exclude a specific file until I decided to include it in the load..."; how do you know when to include it in a load?

Perhaps you should consider Databricks Workflows "File Arrival" trigger (https://docs.databricks.com/en/workflows/jobs/file-arrival-triggers.html#trigger-jobs-when-new-files... . Maybe this could be used to trigger a job run, make a decision on what action to take (aka "until I decided to include it in the load"), and maybe even copy the file to an alternate location once that decision is made and have DLT Autloader watch that new, copied location?

Hope it's helpful.

Hi @brockb ,

Thank you for your reply.

sure, I will elaborate more. so assume we have 3 files ( file1, file2 and file3). file 3 is locked by the system while it's adding data. dlt pipeline throws and error because it can't read file3 ( because it is locked by the system) " at least this is what we can tell for now). so I want to be able to exclude file3 or find an alternative way to have dlt pipeline doesn't crash and try to read the file later when it's unlocked by the system. When I exclude that file (file3) or wait till it's done (unlocked) the pipeline runs fine. We know it's done when we see a new file, like in this case when we see file4 then file3 is completed.
Here's the error if you are curious. 

standup1_0-1716233087018.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group