Delta Live Table SCD2 performance issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Hi Community,
I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as the sequence.
Since the past two weeks, we have observed a significant increase in processing time for this SCD2 step (from 15 minutes to 45 minutes), and I have been unable to optimize it.
Do you have any suggestions for optimizing the SCD2 processing?
More details: I receive a 200 MB Parquet file daily, ingest it, and process it through the SCD2 step to detect historical changes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Are your parquet files clubbed together as a single file if not then try to club them into a single file and then read.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
The DLT automatically applies OPTIMIZE and VACUUM to the data, so I believe that's the case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thursday
hi @scorpusfx1
What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/python-ref#change-data-capture-from-database-snapshots-with-p...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
Thank you for the response. Indeed, this function works well for my case 🙂
However, I am currently using a Data Factory to extract the data into files and process them like the apply_change_snapshot function does.
Today, my pipeline is no longer working, and it's generating the following exception: terminated with exception: [DELTA_MERGE_MATERIALIZE_SOURCE_FAILED_REPEATEDLY] Keeping the source of the MERGE statement materialized has failed repeatedly. SQLSTATE: XXKST.
Any idea please 😞

