Migrating source directory in an existing DLT Pipeline with Autoloader
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
I have a DLT pipeline that reads data in S3 into an append-only bronze layer using Autoloader. The data sink needs to be changed to a new s3 bucket in a new account, and data in the existing s3 bucket migrated to the new one.
Will Autoloader still be able to tell that it has already processed those files after they have been replicated to the new S3 bucket?
I'm thinking of setting Autoloader's `cloudfiles.includeExistingFiles=False` and the table property `{"pipelines.reset.allowed": "false"}` to avoid re-processing all that old Bronze layer data if not. Sample Python code below, for reference:
@dlt.table(
name=f"append_only_bronze_layer",
table_properties={"quality": "bronze"},
)
def raw_bronze_layer() -> DataFrame:
return (
spark.readStream.format("cloudFiles")
.options(
cloudFiles.format="avro",
cloudFiles.inferColumnTypes=True,
)
.load(/this/path/will/change)
)
Thanks in advance for any advice on how to avoid re-processing data when the DLT source path changes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
Might be saved by the documentation here https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/directory-listing-mode...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
Hi SamAdams,
How are you doing today?, As per my understanding, You're on the right track here! When you change the S3 path for Autoloader, even if the files are exactly the same and just copied from the old bucket, Autoloader will treat them as new files because it tracks them based on their original path and metadata. So yes, it could reprocess everything unless you take steps to avoid it. Setting cloudFiles.includeExistingFiles=False is a smart move—it tells Autoloader to ignore any existing files in the new path and only pick up new ones going forward. Adding {"pipelines.reset.allowed": "false"} also helps make sure the pipeline doesn’t accidentally reset and reprocess old data. As long as you keep your checkpoint location the same, and the old files don’t get picked up again, you should be safe. Let me know if you need help double-checking your setup before switching buckets!
Regards,
Brahma

