cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows

UlrikChristense
New Contributor II

I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_<table_name>` and a view on top of this `<table_name>`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?

4 REPLIES 4

Walter_C
Databricks Employee
Databricks Employee

To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:

  1. Filter Out Hidden Rows: When querying the physical table, you can filter out the rows where __rowIsHidden=True. This can be done by adding a condition to your query to exclude these rows. For example:

    SELECT * FROM __apply_changes_storage_<table_name> WHERE __rowIsHidden = False;
  2. Use the View: If possible, use the view <table_name> instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset.

  3. Optimize the Table: Consider optimizing the physical table by periodically running maintenance operations such as VACUUM to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance.

  4. Configure Retention: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the pipelines.cdc.tombstoneGCThresholdInSeconds table property.

Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.

UlrikChristense
New Contributor II

Any ideas?

Walter_C
Databricks Employee
Databricks Employee

Can you confirm you are setting this config properly:

ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0'); 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group