<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data Skipping- Partitioned tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-skipping-partitioned-tables/m-p/122149#M46675</link>
    <description>&lt;P&gt;Hi, delta.dataSkippingStatsColumns specifies a coma-separated list of column names used by Delta Lake to collect statistics. It will improve the performance by skipping those columns since it will supersede the default behavior of analyzing the first 32 columns of the table.&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Set property when creating or modifying a table:&amp;nbsp;
&lt;OL&gt;
&lt;LI&gt;ALTER TABLE your_table_name &lt;BR /&gt;SET TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'col1, col2, col3');&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Databricks Runtime (DBR) 14.3 LTS and above, you can manually compute statistics for existing and future data:
&lt;OL&gt;
&lt;LI&gt;ANALYZE TABLE your_table_name COMPUTE DELTA STATISTICS;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Avoid using Long String Columns since they expensive to analyze. These should be included in delta.dataSkippingStatsColumns&lt;/LI&gt;
&lt;LI&gt;it would be nice for the columns specified for statistics overlap with filtering criteria i.e. partitioning column or high cardinality columns. Statistics on unused or less-filtered columns could waste compute resources without significant benefits.&amp;nbsp;&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Wed, 18 Jun 2025 16:16:07 GMT</pubDate>
    <dc:creator>paolajara</dc:creator>
    <dc:date>2025-06-18T16:16:07Z</dc:date>
    <item>
      <title>Data Skipping- Partitioned tables</title>
      <link>https://community.databricks.com/t5/data-engineering/data-skipping-partitioned-tables/m-p/121725#M46529</link>
      <description>&lt;P&gt;&lt;!--  StartFragment   --&gt;&lt;/P&gt;&lt;P&gt;Hi all,&lt;BR /&gt;&lt;BR /&gt;I have a question- how can we modify delta.dataSkippingStatsColumns and compute statistics for a partitioned delta table in Databricks? I want to understand the process and best practices for changing this setting and ensuring accurate statistical computations for partitioned data. Any guidance would be appreciated.&lt;/P&gt;&lt;P&gt;&lt;!--  EndFragment   --&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jun 2025 15:28:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-skipping-partitioned-tables/m-p/121725#M46529</guid>
      <dc:creator>Sainath368</dc:creator>
      <dc:date>2025-06-13T15:28:57Z</dc:date>
    </item>
    <item>
      <title>Re: Data Skipping- Partitioned tables</title>
      <link>https://community.databricks.com/t5/data-engineering/data-skipping-partitioned-tables/m-p/122149#M46675</link>
      <description>&lt;P&gt;Hi, delta.dataSkippingStatsColumns specifies a coma-separated list of column names used by Delta Lake to collect statistics. It will improve the performance by skipping those columns since it will supersede the default behavior of analyzing the first 32 columns of the table.&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Set property when creating or modifying a table:&amp;nbsp;
&lt;OL&gt;
&lt;LI&gt;ALTER TABLE your_table_name &lt;BR /&gt;SET TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'col1, col2, col3');&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Databricks Runtime (DBR) 14.3 LTS and above, you can manually compute statistics for existing and future data:
&lt;OL&gt;
&lt;LI&gt;ANALYZE TABLE your_table_name COMPUTE DELTA STATISTICS;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/LI&gt;
&lt;LI&gt;Avoid using Long String Columns since they expensive to analyze. These should be included in delta.dataSkippingStatsColumns&lt;/LI&gt;
&lt;LI&gt;it would be nice for the columns specified for statistics overlap with filtering criteria i.e. partitioning column or high cardinality columns. Statistics on unused or less-filtered columns could waste compute resources without significant benefits.&amp;nbsp;&amp;nbsp;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 18 Jun 2025 16:16:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-skipping-partitioned-tables/m-p/122149#M46675</guid>
      <dc:creator>paolajara</dc:creator>
      <dc:date>2025-06-18T16:16:07Z</dc:date>
    </item>
  </channel>
</rss>

