Hi Community,
I am working on ingestion pipelines that take data from Parquet files (200 MB per day) and integrate them into my Lakehouse. This data is used to create an SCD Type 2 using apply_changes, with the row ID as the key and the file date as the sequence.
Since the past two weeks, we have observed a significant increase in processing time for this SCD2 step (from 15 minutes to 45 minutes), and I have been unable to optimize it.
Do you have any suggestions for optimizing the SCD2 processing?
More details: I receive a 200 MB Parquet file daily, ingest it, and process it through the SCD2 step to detect historical changes.