Databricks Community

scorpusfx1 · a week ago

Hi Community,

I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as the sequence.

Since the past two weeks, we have observed a significant increase in processing time for this SCD2 step (from 15 minutes to 45 minutes), and I have been unable to optimize it.

Do you have any suggestions for optimizing the SCD2 processing?

More details: I receive a 200 MB Parquet file daily, ingest it, and process it through the SCD2 step to detect historical changes.

BricksGuy · a week ago

Are your parquet files clubbed together as a single file if not then try to club them into a single file and then read.

scorpusfx1 · Friday

The DLT automatically applies OPTIMIZE and VACUUM to the data, so I believe that's the case.

Stefan-Koch · Thursday

hi @scorpusfx1

What kind of source data do you have? Are these parquet files daily full snapshots of source tables? If so, you should use apply_changes_from_snapshot, which is exactly built for this use case. https://docs.databricks.com/aws/en/dlt/python-ref#change-data-capture-from-database-snapshots-with-p...

scorpusfx1 · Friday

Thank you for the response. Indeed, this function works well for my case 🙂

However, I am currently using a Data Factory to extract the data into files and process them like the apply_change_snapshot function does.

Today, my pipeline is no longer working, and it's generating the following exception: terminated with exception: [DELTA_MERGE_MATERIALIZE_SOURCE_FAILED_REPEATEDLY] Keeping the source of the MERGE statement materialized has failed repeatedly. SQLSTATE: XXKST.

Any idea please 😞