cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to manage data reload in DLT

harvey-c
New Contributor III

Hi, Community members

I had an situation to reload some data via DLT pipeline.  All data are stored in landing storage account and they have been loaded in daily base. For example, from 1/Nov to 30/Nov.

For some reason, I need to reload the data of 25/Nov and I tried to use the following parameters to force the data relaod: 

  .option("cloudFiles.includeExistingFiles", includeExistingFiles)
        .option("modifiedBefore",modifiedBefore)
        .option("modifiedAfter",modifiedAfter)

 

However, no data are loaded or reloaded, even I deleted the data of the day in the bronze table. I guess it might because the checkpoint does not allow me to reload the data.  and I end up with reload the data to a new schema which is not a desired outcome.

Could you please advise how I should manage the data reload scenario? 

Thank you!

 

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @harvey-c, Certainly! Databricks Auto Loader provides several configuration options for efficiently ingesting data from an S3 bucket or directory.

 

Let’s focus on the relevant options for listing only new files:

 

cloudFiles.allowOverwrites: This boolean option controls whether input directory file changes can overwrite existing data. It’s available in Databricks Runtime 7.6 and above. By default, it’s set to false.

cloudFiles.backfillInterval: In asynchronous backfill mode, Auto Loader triggers backfills at a specified interval (e.g., once a day or once a week). Backfills help ensure that all files eventually get processed, especially when file event notification systems don’t guarantee 100% delivery of uploaded files. Available in Databricks Runtime 8.4 and above (unsupported).

cloudFiles.includeExistingFiles: This boolean option determines whether to include existing files in the stream processing input path or only process new files arriving after initial setup. It’s evaluated only when you start a stream for the first time and has no effect after restarting the stream. By default, it’s set to true.

cloudFiles.inferColumnTypes: When leveraging schema inference, this boolean option controls whether to infer exact column types. By default, columns are inferred as strings when inferring JSON and CSV datasets. Set it to true if you want precise column type inference.

cloudFiles.maxBytesPerTrigger: Specify the maximum number of new bytes to process in each trigger. For example, you can limit each microbatch to 10 GB of data. This is a soft maximum, and Databricks processes up to the lower limit of either cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first.

 

Remember that Auto Loader’s directory listing mode allows you to quickly start streams without additional permission configurations beyond access to your data on cloud storage.

 

Happy data ingestion! 🚀

 

For more details, you can refer to the official Databricks documentation.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group