cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Reprocessing the data with Auto Loader

Eldar_Dragomir
New Contributor II

Could you please provide me an idea how I can start reprocessing of my data? 
Imagine I have a folder in adls gen2 "/test" with binaryFiles. They already processed with current pipeline. 
I want to reprocess the data + continue receive new data. 
What the settings I have to set for that?
Do I need two "loads" or I can use one with Trigger.AvailableNow with setting of file limitation per batch?

1 ACCEPTED SOLUTION

Accepted Solutions

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Eldar_Dragomir 

In order to re-process the data, we have to change the checkpoint directory. This will start processing the files from the beginning. You can use cloudFiles.maxFilesPerTrigger, to limit the number of files getting processed per micro-batch for maintaining the stability of the pipeline.

View solution in original post

1 REPLY 1

Tharun-Kumar
Honored Contributor II
Honored Contributor II

@Eldar_Dragomir 

In order to re-process the data, we have to change the checkpoint directory. This will start processing the files from the beginning. You can use cloudFiles.maxFilesPerTrigger, to limit the number of files getting processed per micro-batch for maintaining the stability of the pipeline.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.