cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Table SCD2 performance issue

scorpusfx1
New Contributor II

Hi Community,

I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as the sequence.

Since the past two weeks, we have observed a significant increase in processing time for this SCD2 step (from 15 minutes to 45 minutes), and I have been unable to optimize it.

Do you have any suggestions for optimizing the SCD2 processing?

More details: I receive a 200 MB Parquet file daily, ingest it, and process it through the SCD2 step to detect historical changes.

4 REPLIES 4

BricksGuy
New Contributor III

Are your parquet files clubbed together as a single file if not then try to club them into a single file and then read.

The DLT automatically applies OPTIMIZE and VACUUM to the data, so I believe that's the case.

Stefan-Koch
Valued Contributor II

hi @scorpusfx1 

What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/python-ref#change-data-capture-from-database-snapshots-with-p... 

Thank you for the response. Indeed, this function works well for my case 🙂

However, I am currently using a Data Factory to extract the data into files and process them like the apply_change_snapshot function does.

Today, my pipeline is no longer working, and it's generating the following exception: terminated with exception: [DELTA_MERGE_MATERIALIZE_SOURCE_FAILED_REPEATEDLY] Keeping the source of the MERGE statement materialized has failed repeatedly. SQLSTATE: XXKST.

Any idea please 😞 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group