topic Re: Delta Live Table SCD2 performance issue in Data Engineering

Delta Live Table SCD2 performance issue

scorpusfx1 — Thu, 13 Mar 2025 09:01:45 GMT

Hi Community,

I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as the sequence.

Since the past two weeks, we have observed a significant increase in processing time for this SCD2 step (from 15 minutes to 45 minutes), and I have been unable to optimize it.

Do you have any suggestions for optimizing the SCD2 processing?

More details: I receive a 200 MB Parquet file daily, ingest it, and process it through the SCD2 step to detect historical changes.

Re: Delta Live Table SCD2 performance issue

BricksGuy — Thu, 13 Mar 2025 15:04:26 GMT

Are your parquet files clubbed together as a single file if not then try to club them into a single file and then read.

Re: Delta Live Table SCD2 performance issue

Stefan-Koch — Fri, 14 Mar 2025 05:40:09 GMT

hi @scorpusfx1

What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/python-ref#change-data-capture-from-database-snapshots-with-python-in-dlt

Re: Delta Live Table SCD2 performance issue

scorpusfx1 — Fri, 14 Mar 2025 07:42:46 GMT

The DLT automatically applies OPTIMIZE and VACUUM to the data, so I believe that's the case.

Re: Delta Live Table SCD2 performance issue

scorpusfx1 — Fri, 14 Mar 2025 07:52:14 GMT

Thank you for the response. Indeed, this function works well for my case 🙂

However, I am currently using a Data Factory to extract the data into files and process them like the apply_change_snapshot function does.

Today, my pipeline is no longer working, and it's generating the following exception: terminated with exception: [DELTA_MERGE_MATERIALIZE_SOURCE_FAILED_REPEATEDLY] Keeping the source of the MERGE statement materialized has failed repeatedly. SQLSTATE: XXKST.

Any idea please 😞