topic Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows in Data Engineering

Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

UlrikChristense — Tue, 14 Jan 2025 07:27:16 GMT

I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_<table_name>` and a view on top of this `<table_name>`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?

Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

Walter_C — Tue, 14 Jan 2025 10:13:26 GMT

To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:

Filter Out Hidden Rows: When querying the physical table, you can filter out the rows where __rowIsHidden=True. This can be done by adding a condition to your query to exclude these rows. For example:
```
SELECT * FROM __apply_changes_storage_<table_name> WHERE __rowIsHidden = False;
```
Use the View: If possible, use the view <table_name> instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset.
Optimize the Table: Consider optimizing the physical table by periodically running maintenance operations such as VACUUM to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance.
Configure Retention: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the pipelines.cdc.tombstoneGCThresholdInSeconds table property.

Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

UlrikChristense — Tue, 14 Jan 2025 11:37:37 GMT

Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.

Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

UlrikChristense — Wed, 15 Jan 2025 10:17:03 GMT

Any ideas?

Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

Walter_C — Wed, 15 Jan 2025 14:34:39 GMT

Can you confirm you are setting this config properly:

ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0');

Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

UlrikChristense — Wed, 22 Jan 2025 13:22:18 GMT

I'm trying, but doesn't seem to change anything. Setting these table properties - when are the "applied"? When the job is run, or as a background thing?