<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Best Practice for Updating Data Skipping Statistics for Additional Columns in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139281#M51136</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;, thank you for your response.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I update the table property to include all 42 columns (new 10 + old 32 cols) in delta.dataSkippingStatsColumns,&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pooja_bhumandla_0-1763366089648.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21745i668FC82FEFDCC0C4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="pooja_bhumandla_0-1763366089648.png" alt="pooja_bhumandla_0-1763366089648.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;and then run ANALYZE TABLE only for the new 10 columns (col33–col42):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pooja_bhumandla_1-1763366216377.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21747i99B15D30C86F2A5C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="pooja_bhumandla_1-1763366216377.png" alt="pooja_bhumandla_1-1763366216377.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;My Questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Will this keep the existing statistics for columns 1–32 unchanged and only compute statistics for columns 33–42?&lt;/LI&gt;&lt;LI&gt;Are the statistics for both the old (1–32) and new (33–42) columns maintained together in the same centralized metadata location, or are they managed separately?&lt;/LI&gt;&lt;LI&gt;Does this approach ensure that all 42 columns will have statistics available for data skipping without any redundant recomputation? (i.e., existing stats stay as-is, and only new stats are added)&lt;/LI&gt;&lt;LI&gt;After setting the dataSkipping property for all 42 columns, will future data loads automatically generate and maintain statistics for all 42 columns going forward?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Appreciate any insights and clarifications!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 17 Nov 2025 08:09:02 GMT</pubDate>
    <dc:creator>pooja_bhumandla</dc:creator>
    <dc:date>2025-11-17T08:09:02Z</dc:date>
    <item>
      <title>Best Practice for Updating Data Skipping Statistics for Additional Columns</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139057#M51082</link>
      <description>&lt;P&gt;Hi Community,&lt;BR /&gt;I have a scenario where I’ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.&lt;/P&gt;&lt;P&gt;Goal:&lt;BR /&gt;I want a robust way to calculate statistics for the new 10 columns along with the existing 32 columns without unnecessary recomputation or fragmented maintenance.&lt;/P&gt;&lt;P&gt;Here’s the challenge:&lt;BR /&gt;1. If I alter the dataskipping property for all 42 columns (32 existing + 10 new) and recompute statistics for all tables, it introduces unnecessary recomputation.&lt;/P&gt;&lt;P&gt;What’s the recommended approach or best practice here?&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Appreciate any insights!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Nov 2025 10:53:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139057#M51082</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-11-14T10:53:43Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practice for Updating Data Skipping Statistics for Additional Columns</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139096#M51094</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170125"&gt;@pooja_bhumandla&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.&lt;/P&gt;&lt;P&gt;- &lt;STRONG&gt;delta.dataSkippingNumIndexedCols&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;- &lt;STRONG&gt;delta.dataSkippingStatsColumns&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In Databricks Runtime 14.3 LTS and above, if you have altered the table properties or changed the specified columns for statistics, you can manually trigger the recomputation of statistics for a Delta table using the following command:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;ANALYZE TABLE table_name COMPUTE DELTA STATISTICS&lt;/LI-CODE&gt;&lt;P&gt;But you don't have to recompute statistics for all columns. This command supports listing the columns for which you'd like to refresh stats.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1763129695038.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21710iA8DAB4D28A838CFB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1763129695038.png" alt="szymon_dybczak_0-1763129695038.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;So, either&amp;nbsp;&lt;SPAN&gt;delta.dataSkippingNumIndexedCols to override 32 col limitation and then run analyze table manually or use delta.dataSkippingStatsColumns, list all of required columns and run analyze table manually.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Nov 2025 14:17:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139096#M51094</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-11-14T14:17:40Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practice for Updating Data Skipping Statistics for Additional Columns</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139281#M51136</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;, thank you for your response.&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I update the table property to include all 42 columns (new 10 + old 32 cols) in delta.dataSkippingStatsColumns,&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pooja_bhumandla_0-1763366089648.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21745i668FC82FEFDCC0C4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="pooja_bhumandla_0-1763366089648.png" alt="pooja_bhumandla_0-1763366089648.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;and then run ANALYZE TABLE only for the new 10 columns (col33–col42):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="pooja_bhumandla_1-1763366216377.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21747i99B15D30C86F2A5C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="pooja_bhumandla_1-1763366216377.png" alt="pooja_bhumandla_1-1763366216377.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;My Questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Will this keep the existing statistics for columns 1–32 unchanged and only compute statistics for columns 33–42?&lt;/LI&gt;&lt;LI&gt;Are the statistics for both the old (1–32) and new (33–42) columns maintained together in the same centralized metadata location, or are they managed separately?&lt;/LI&gt;&lt;LI&gt;Does this approach ensure that all 42 columns will have statistics available for data skipping without any redundant recomputation? (i.e., existing stats stay as-is, and only new stats are added)&lt;/LI&gt;&lt;LI&gt;After setting the dataSkipping property for all 42 columns, will future data loads automatically generate and maintain statistics for all 42 columns going forward?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Appreciate any insights and clarifications!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Nov 2025 08:09:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/139281#M51136</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-11-17T08:09:02Z</dc:date>
    </item>
    <item>
      <title>Re: Best Practice for Updating Data Skipping Statistics for Additional Columns</title>
      <link>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/141452#M51729</link>
      <description>&lt;P class="p8i6j01 paragraph"&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170125"&gt;@pooja_bhumandla&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;To answer your second set of questions directly:&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;1) &lt;STRONG&gt;Will running ANALYZE for only col33–col42 leave cols 1–32 unchanged?&lt;/STRONG&gt;&lt;BR /&gt;Yes. With the table property set to include all 42 columns, you can run a targeted recomputation just for the new 10 columns:&lt;/P&gt;
&lt;DIV class="l8rrz21 _1ibi0s3cl" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql p8i6j0e hljs language-sql _12n1b832"&gt;   ANALYZE &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; your_catalog.your_schema.your_table
   COMPUTE DELTA STATISTICS
   &lt;SPAN class="hljs-keyword"&gt;FOR&lt;/SPAN&gt; COLUMNS col33, col34, col35, col36, col37, col38, col39, col40, col41, col42;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="l8rrz23 _1ibi0s32y _1ibi0s3cm _1ibi0s3ay _1ibi0s3bo"&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;&lt;SPAN&gt;This recomputes Delta file-skipping stats only for the specified columns. Existing stats on cols 1–32 are left as-is.&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;&lt;SPAN&gt;2) &lt;STRONG&gt;Are stats for old and new columns kept together centrally or separately?&lt;/STRONG&gt;&lt;BR /&gt;The&lt;/SPAN&gt;y’re maintained together in the Delta log at the file level (one set of file-level stats per file, including the configured columns). There isn’t a separate store per column. The stats schema in the log enumerates which columns have statistics and stores their minimum, maximum, and null counts for each file.&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;3) &lt;STRONG&gt;Does this avoid redundant recomputation?&lt;/STRONG&gt;&lt;BR /&gt;Yes. Because you target only col33–col42 with FOR COLUMNS, only those columns are recomputed. Existing stats for cols 1–32 remain untouched, so you avoid recomputing work already done.&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="l8rrz25 _1ibi0s3cj"&gt;4) &lt;STRONG&gt;Will future data loads automatically maintain stats for all 42 columns?&lt;/STRONG&gt;&lt;BR /&gt;Yes. After setting delta.dataSkippingStatsColumns (or raising delta.dataSkippingNumIndexedCols), future writes automatically collect file-skipping stats for the configured columns.&amp;nbsp;Property changes affect future stats collection and don’t retroactively recompute for existing files.&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Mon, 08 Dec 2025 19:38:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/best-practice-for-updating-data-skipping-statistics-for/m-p/141452#M51729</guid>
      <dc:creator>stbjelcevic</dc:creator>
      <dc:date>2025-12-08T19:38:41Z</dc:date>
    </item>
  </channel>
</rss>

