Databricks Community

UlrikChristense · ‎01-13-2025

I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_<table_name>` and a view on top of this `<table_name>`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?

Walter_C · ‎01-14-2025

To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:

Filter Out Hidden Rows: When querying the physical table, you can filter out the rows where __rowIsHidden=True. This can be done by adding a condition to your query to exclude these rows. For example:
```
SELECT * FROM __apply_changes_storage_<table_name> WHERE __rowIsHidden = False;
```
Use the View: If possible, use the view <table_name> instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset.
Optimize the Table: Consider optimizing the physical table by periodically running maintenance operations such as VACUUM to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance.
Configure Retention: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the pipelines.cdc.tombstoneGCThresholdInSeconds table property.

UlrikChristense · ‎01-14-2025

Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.

UlrikChristense · ‎01-15-2025

Any ideas?

Walter_C · ‎01-15-2025

Can you confirm you are setting this config properly:

ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0');

UlrikChristense · ‎01-22-2025

I'm trying, but doesn't seem to change anything. Setting these table properties - when are the "applied"? When the job is run, or as a background thing?

Databricks Community

Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

Join Us as a Local Community Builder!

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Find Sensitive Data at Scale with Data Classification in Unity Catalog

Solution Accelerator Series | #6 - Adverse Drug Event Detection

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops