<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105688#M42240</link>
    <description>&lt;P&gt;Any ideas?&lt;/P&gt;</description>
    <pubDate>Wed, 15 Jan 2025 10:17:03 GMT</pubDate>
    <dc:creator>UlrikChristense</dc:creator>
    <dc:date>2025-01-15T10:17:03Z</dc:date>
    <item>
      <title>Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105530#M42169</link>
      <description>&lt;P&gt;I have a lot of DLT tables creating using the `apply_changes` function with type 2 history. This functions creates a physical table `__apply_changes_storage_&amp;lt;table_name&amp;gt;` and a view on top of this `&amp;lt;table_name&amp;gt;`. The number of rows the physical table is about 100x as large as the view, and it seems to be because there are a lot of rows with `__rowIsHidden=True`. Since I also want to be able to query the physical table from a non-spark environment, this gives huge performance slowdown. Is there any way to avoid these rows (I guess they exist to be able to handle late-arriving data or deletes or something of this sort, but maybe there is a way to configure this)?&lt;/P&gt;</description>
      <pubDate>Tue, 14 Jan 2025 07:27:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105530#M42169</guid>
      <dc:creator>UlrikChristense</dc:creator>
      <dc:date>2025-01-14T07:27:16Z</dc:date>
    </item>
    <item>
      <title>Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105550#M42180</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;To address the performance slowdown when querying the physical table from a non-Spark environment, you can consider the following options:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Filter Out Hidden Rows&lt;/STRONG&gt;: When querying the physical table, you can filter out the rows where &lt;CODE&gt;__rowIsHidden=True&lt;/CODE&gt;. This can be done by adding a condition to your query to exclude these rows. For example:&lt;/P&gt;
&lt;DIV class="gb5fhw2"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql _1t7bu9hb hljs language-sql gb5fhw3"&gt;&lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; __apply_changes_storage_&lt;SPAN class="hljs-operator"&gt;&amp;lt;&lt;/SPAN&gt;table_name&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;WHERE&lt;/SPAN&gt; __rowIsHidden &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="hljs-literal"&gt;False&lt;/SPAN&gt;;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Use the View&lt;/STRONG&gt;: If possible, use the view &lt;CODE&gt;&amp;lt;table_name&amp;gt;&lt;/CODE&gt; instead of the physical table for your queries. The view is designed to filter out the hidden rows and provide a cleaner dataset.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Optimize the Table&lt;/STRONG&gt;: Consider optimizing the physical table by periodically running maintenance operations such as &lt;CODE&gt;VACUUM&lt;/CODE&gt; to remove old versions of the data that are no longer needed. This can help reduce the size of the table and improve query performance.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Configure Retention&lt;/STRONG&gt;: Adjust the retention settings for the CDC tombstones if your use case allows for a shorter retention period. This can be configured with the &lt;CODE&gt;pipelines.cdc.tombstoneGCThresholdInSeconds&lt;/CODE&gt; table property.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Tue, 14 Jan 2025 10:13:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105550#M42180</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-01-14T10:13:26Z</dc:date>
    </item>
    <item>
      <title>Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105561#M42183</link>
      <description>&lt;P&gt;Suggestion 4 seems to be the only one, which might actually reduce the number of rows in the physical table. However, i can't get it to work. I have set this property to 0, yet the number of hidden rows remains the same.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Jan 2025 11:37:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105561#M42183</guid>
      <dc:creator>UlrikChristense</dc:creator>
      <dc:date>2025-01-14T11:37:37Z</dc:date>
    </item>
    <item>
      <title>Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105688#M42240</link>
      <description>&lt;P&gt;Any ideas?&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jan 2025 10:17:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105688#M42240</guid>
      <dc:creator>UlrikChristense</dc:creator>
      <dc:date>2025-01-15T10:17:03Z</dc:date>
    </item>
    <item>
      <title>Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105717#M42252</link>
      <description>&lt;P&gt;Can you confirm you are setting this config properly:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;ALTER TABLE your_table_name SET TBLPROPERTIES ('pipelines.cdc.tombstoneGCThresholdInSeconds' = '60', 'pipelines.cdc.tombstoneGCFrequencyInSeconds' = '0'); &lt;/LI-CODE&gt;</description>
      <pubDate>Wed, 15 Jan 2025 14:34:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/105717#M42252</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-01-15T14:34:39Z</dc:date>
    </item>
    <item>
      <title>Re: Apply-changes-table (SCD2) with huge amounts of `rowIsHidden=True` rows</title>
      <link>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/106654#M42539</link>
      <description>&lt;P&gt;I'm trying, but doesn't seem to change anything. Setting these table properties - when are the "applied"? When the job is run, or as a background thing?&lt;/P&gt;</description>
      <pubDate>Wed, 22 Jan 2025 13:22:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/apply-changes-table-scd2-with-huge-amounts-of-rowishidden-true/m-p/106654#M42539</guid>
      <dc:creator>UlrikChristense</dc:creator>
      <dc:date>2025-01-22T13:22:18Z</dc:date>
    </item>
  </channel>
</rss>

