cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delt Live Table Path/Directory help

standup1
Contributor

Hello, I am working on a dlt pipeline and I've been facing an issue. I hope someone here can help me find a solution.
My files are json in azure storage. These files are stored in dircctory like this ( blobName/FolderName/xx.csv).
The folder name is like this ( 2024-08-20T12.00.00Z). We get these files all day and the name changes. The problem I have is that when I run the pipeline I should exclude today's files .

Here's part of my script

 .load(blobName"/"+FolderName+"/*.json"), so if I run it today, I will do  .load(blobName"/"+"2024-09-11*Z"+"/*.json"). This will work fine and pick up all files from 2024-09-11. The problem is the following day. When I change it to .load(blobName"/"+"2024-09-12*Z"+"/*.csv"), it doesn't pick up all the files. I think it only picks up the new files that came after yesterday's run time. Is there anything I am doing wrong with those "*" wild card in my path? I appreciate any help.

7 REPLIES 7

filipniziol
Contributor III

Hi @standup1 ,

Autoloader is using checkpointing mechanism to track files that have been processed. When you run your DLT pipeline, Autoloader remembers the state of files processed during the last run and it will not reprocess files that were previously processed.

Do you really want to reprocess the existing files from previous days?

 

Thanks for your reply. That makes a lot of sense. So it looks like dlt will scan those files but it won’t load them to the df, next day I run it to pick up previous day data. It only brings some of them. I need to reprocess those files, but only from yesterday. Do you know if there’s any workaround ?i tried option("cloudFiles.includeExistingFiles",”true”) but that didn’t do it.

filipniziol
Contributor III

option("cloudFiles.includeExistingFiles",”true”) will not work.

This option is responsible for processing the files in the folder when autoloader is started, so it matters only in the first run of autoloader and it is default to true.

Could you once again explain what you are trying to achieve?

Why do you want to reprocess the already processed files?

Do I understand correctly, that the requirement is to exclude the files from today?

In the code shared you say you are doing .load(blobName"/"+FolderName+"/*.json"), so it should pick up all the files that were not processed previously.

standup1
Contributor

Do I understand correctly, that the requirement is to exclude the files from today? Yes, excatly

In the code shared you say you are doing .load(blobName"/"+FolderName+"/*.json"), so it should pick up all the files that were not processed previously. this is the problem. it doesn't process all files. 

So, What I am trying to process are the files that were excluded from yesterday's run. For example, today (September 11th), the pipeline should process all files up to yesterday, excluding today's files since they're incomplete. Tomorrow (September 12th), the pipeline should process all files up to yesterday, including the files that were excluded yesterday, but still exclude today's files. In essence, the pipeline should always exclude files from the current run day but can process any files prior to that day.

filipniziol
Contributor III

Hi @standup1 ,

To do so you need to calculate today_date, add file_date to dataframe and then to get only records where file_date less than today_date:

My folders:

filipniziol_0-1726124068229.png

The code to get files excluding today folders based on folder name

filipniziol_1-1726124195128.png

Output:

filipniziol_2-1726124354828.png

Hope it helps

 

 

standup1
Contributor

Hi @filipniziol ,

Thank you so much for sharing this example with me. This is very helpful. I appreciate your help.

filipniziol
Contributor III

Hi @standup1 , I'm glad the example was helpful

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group