cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reprocessing the data with Auto Loader

Eldar_Dragomir
New Contributor II

Could you please provide me an idea how I can start reprocessing of my data? 
Imagine I have a folder in adls gen2 "/test" with binaryFiles. They already processed with current pipeline. 
I want to reprocess the data + continue receive new data. 
What the settings I have to set for that?
Do I need two "loads" or I can use one with Trigger.AvailableNow with setting of file limitation per batch?

1 ACCEPTED SOLUTION

Accepted Solutions

Tharun-Kumar
Databricks Employee
Databricks Employee

@Eldar_Dragomir 

In order to re-process the data, we have to change the checkpoint directory. This will start processing the files from the beginning. You can use cloudFiles.maxFilesPerTrigger, to limit the number of files getting processed per micro-batch for maintaining the stability of the pipeline.

View solution in original post

1 REPLY 1

Tharun-Kumar
Databricks Employee
Databricks Employee

@Eldar_Dragomir 

In order to re-process the data, we have to change the checkpoint directory. This will start processing the files from the beginning. You can use cloudFiles.maxFilesPerTrigger, to limit the number of files getting processed per micro-batch for maintaining the stability of the pipeline.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group