cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to manage data reload in DLT

harvey-c
New Contributor III

Hi, Community members

I had an situation to reload some data via DLT pipeline.  All data are stored in landing storage account and they have been loaded in daily base. For example, from 1/Nov to 30/Nov.

For some reason, I need to reload the data of 25/Nov and I tried to use the following parameters to force the data relaod: 

  .option("cloudFiles.includeExistingFiles", includeExistingFiles)
        .option("modifiedBefore",modifiedBefore)
        .option("modifiedAfter",modifiedAfter)

 

However, no data are loaded or reloaded, even I deleted the data of the day in the bronze table. I guess it might because the checkpoint does not allow me to reload the data.  and I end up with reload the data to a new schema which is not a desired outcome.

Could you please advise how I should manage the data reload scenario? 

Thank you!

 

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @harvey-c, Certainly! Databricks Auto Loader provides several configuration options for efficiently ingesting data from an S3 bucket or directory.

 

Let’s focus on the relevant options for listing only new files:

 

cloudFiles.allowOverwrites: This boolean option controls whether input directory file changes can overwrite existing data. It’s available in Databricks Runtime 7.6 and above. By default, it’s set to false.

cloudFiles.backfillInterval: In asynchronous backfill mode, Auto Loader triggers backfills at a specified interval (e.g., once a day or once a week). Backfills help ensure that all files eventually get processed, especially when file event notification systems don’t guarantee 100% delivery of uploaded files. Available in Databricks Runtime 8.4 and above (unsupported).

cloudFiles.includeExistingFiles: This boolean option determines whether to include existing files in the stream processing input path or only process new files arriving after initial setup. It’s evaluated only when you start a stream for the first time and has no effect after restarting the stream. By default, it’s set to true.

cloudFiles.inferColumnTypes: When leveraging schema inference, this boolean option controls whether to infer exact column types. By default, columns are inferred as strings when inferring JSON and CSV datasets. Set it to true if you want precise column type inference.

cloudFiles.maxBytesPerTrigger: Specify the maximum number of new bytes to process in each trigger. For example, you can limit each microbatch to 10 GB of data. This is a soft maximum, and Databricks processes up to the lower limit of either cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first.

 

Remember that Auto Loader’s directory listing mode allows you to quickly start streams without additional permission configurations beyond access to your data on cloud storage.

 

Happy data ingestion! 🚀

 

For more details, you can refer to the official Databricks documentation.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.