Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-13-2025 11:27 PM
I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_<table_name>` and a view on top of this `<table_name>`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-14-2025 02:13 AM
To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:
-
Filter Out Hidden Rows: When querying the physical table, you can filter out the rows where
__rowIsHidden=True
. This can be done by adding a condition to your query to exclude these rows. For example:SELECT * FROM __apply_changes_storage_<table_name> WHERE __rowIsHidden = False;
-
Use the View: If possible, use the view
<table_name>
instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset. -
Optimize the Table: Consider optimizing the physical table by periodically running maintenance operations such as
VACUUM
to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance. -
Configure Retention: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds
table property.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-14-2025 03:37 AM
Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 02:17 AM
Any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 06:34 AM
Can you confirm you are setting this config properly:
ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0');
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2025 05:22 AM
I'm trying, but doesn't seem to change anything. Setting these table properties - when are the "applied"? When the job is run, or as a background thing?

