cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Migrating source directory in an existing DLT Pipeline with Autoloader

SamAdams
New Contributor III

I have a DLT pipeline that reads data in S3 into an append-only bronze layer using Autoloader. The data sink needs to be changed to a new s3 bucket in a new account, and data in the existing s3 bucket migrated to the new one.

Will Autoloader still be able to tell that it has already processed those files after they have been replicated to the new S3 bucket?

I'm thinking of setting Autoloader's `cloudfiles.includeExistingFiles=False` and the table property `{"pipelines.reset.allowed": "false"}` to avoid re-processing all that old Bronze layer data if not. Sample Python code below, for reference:

@dlt.table(
name=f"append_only_bronze_layer",
table_properties={"quality": "bronze"},
)
def raw_bronze_layer() -> DataFrame:
return (
spark.readStream.format("cloudFiles")
.options(
cloudFiles.format="avro",
cloudFiles.inferColumnTypes=True,
)
.load(/this/path/will/change)
)

Thanks in advance for any advice on how to avoid re-processing data when the DLT source path changes.

2 REPLIES 2

SamAdams
New Contributor III

Brahmareddy
Honored Contributor II

Hi SamAdams,

How are you doing today?, As per my understanding, You're on the right track here! When you change the S3 path for Autoloader, even if the files are exactly the same and just copied from the old bucket, Autoloader will treat them as new files because it tracks them based on their original path and metadata. So yes, it could reprocess everything unless you take steps to avoid it. Setting cloudFiles.includeExistingFiles=False is a smart move—it tells Autoloader to ignore any existing files in the new path and only pick up new ones going forward. Adding {"pipelines.reset.allowed": "false"} also helps make sure the pipeline doesn’t accidentally reset and reprocess old data. As long as you keep your checkpoint location the same, and the old files don’t get picked up again, you should be safe. Let me know if you need help double-checking your setup before switching buckets!

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now