Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_<table_name>` and a view on top of this `<table_name>`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:
-
Filter Out Hidden Rows: When querying the physical table, you can filter out the rows where
__rowIsHidden=True
. This can be done by adding a condition to your query to exclude these rows. For example:SELECT * FROM __apply_changes_storage_<table_name> WHERE __rowIsHidden = False;
-
Use the View: If possible, use the view
<table_name>
instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset. -
Optimize the Table: Consider optimizing the physical table by periodically running maintenance operations such as
VACUUM
to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance. -
Configure Retention: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the
pipelines.cdc.tombstoneGCThresholdInSeconds
table property.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Can you confirm you are setting this config properly:
ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0');