<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: check statistics of clustering columns per file to see how liquid clustering works in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/149692#M53157</link>
    <description>&lt;P&gt;Good Day&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/147736"&gt;@smoortema&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Delta stores per-file statistics (min, max, null count, row count) for columns enabled for data skipping. You can read those directly from the transaction log.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 1: Get the table location&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;DESCRIBE DETAIL catalog.schema.my_table;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Copy the &lt;CODE&gt;location&lt;/CODE&gt; value — you'll need it in the next query.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 2: Query file-level stats for your clustering columns&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;WITH raw_log AS (
  SELECT
    add.path  AS file_path,
    add.stats AS stats_json
  FROM json.`s3://bucket/path/my_table/_delta_log/*.json`
  WHERE add IS NOT NULL
),
parsed AS (
  SELECT
    file_path,
    from_json(
      stats_json,
      'numRecords LONG,
       minValues MAP&amp;lt;STRING,STRING&amp;gt;,
       maxValues MAP&amp;lt;STRING,STRING&amp;gt;,
       nullCount MAP&amp;lt;STRING,LONG&amp;gt;'
    ) AS s
  FROM raw_log
)
SELECT
  file_path,
  s.numRecords                          AS rows_in_file,
  CAST(s.minValues['c1'] AS &amp;lt;type&amp;gt;)    AS min_c1,
  CAST(s.maxValues['c1'] AS &amp;lt;type&amp;gt;)    AS max_c1,
  CAST(s.minValues['c2'] AS &amp;lt;type&amp;gt;)    AS min_c2,
  CAST(s.maxValues['c2'] AS &amp;lt;type&amp;gt;)    AS max_c2,
  CAST(s.minValues['c3'] AS &amp;lt;type&amp;gt;)    AS min_c3,
  CAST(s.maxValues['c3'] AS &amp;lt;type&amp;gt;)    AS max_c3
FROM parsed;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Replace &lt;CODE&gt;s3://bucket/path/my_table&lt;/CODE&gt; with your actual location, and &lt;CODE&gt;&amp;lt;type&amp;gt;&lt;/CODE&gt; with the real SQL types for each column (e.g. &lt;CODE&gt;INT&lt;/CODE&gt;, &lt;CODE&gt;DATE&lt;/CODE&gt;, &lt;CODE&gt;TIMESTAMP&lt;/CODE&gt;).&lt;/P&gt;
&lt;P&gt;You'll get one row per data file showing the min/max range for each clustering column.&lt;/P&gt;
&lt;P&gt;Worth flagging: stats only exist for columns configured for data skipping — by default the first N columns, or those set explicitly via &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;. Columns with types like &lt;CODE&gt;VARIANT&lt;/CODE&gt; won't have min/max stats.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
    <pubDate>Tue, 03 Mar 2026 12:20:18 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2026-03-03T12:20:18Z</dc:date>
    <item>
      <title>check statistics of clustering columns per file to see how liquid clustering works</title>
      <link>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/149572#M53125</link>
      <description>&lt;P&gt;I have a Delta table on which I set up liquid clustering using three columns. I would like to check file statistics to see how the clustering column values are distributed along the files. How can I write a query that shows min and max values, etc. of the clustering columns for each of the files?&lt;/P&gt;</description>
      <pubDate>Sun, 01 Mar 2026 20:46:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/149572#M53125</guid>
      <dc:creator>smoortema</dc:creator>
      <dc:date>2026-03-01T20:46:50Z</dc:date>
    </item>
    <item>
      <title>Re: check statistics of clustering columns per file to see how liquid clustering works</title>
      <link>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/149692#M53157</link>
      <description>&lt;P&gt;Good Day&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/147736"&gt;@smoortema&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Delta stores per-file statistics (min, max, null count, row count) for columns enabled for data skipping. You can read those directly from the transaction log.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 1: Get the table location&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;DESCRIBE DETAIL catalog.schema.my_table;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Copy the &lt;CODE&gt;location&lt;/CODE&gt; value — you'll need it in the next query.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Step 2: Query file-level stats for your clustering columns&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;WITH raw_log AS (
  SELECT
    add.path  AS file_path,
    add.stats AS stats_json
  FROM json.`s3://bucket/path/my_table/_delta_log/*.json`
  WHERE add IS NOT NULL
),
parsed AS (
  SELECT
    file_path,
    from_json(
      stats_json,
      'numRecords LONG,
       minValues MAP&amp;lt;STRING,STRING&amp;gt;,
       maxValues MAP&amp;lt;STRING,STRING&amp;gt;,
       nullCount MAP&amp;lt;STRING,LONG&amp;gt;'
    ) AS s
  FROM raw_log
)
SELECT
  file_path,
  s.numRecords                          AS rows_in_file,
  CAST(s.minValues['c1'] AS &amp;lt;type&amp;gt;)    AS min_c1,
  CAST(s.maxValues['c1'] AS &amp;lt;type&amp;gt;)    AS max_c1,
  CAST(s.minValues['c2'] AS &amp;lt;type&amp;gt;)    AS min_c2,
  CAST(s.maxValues['c2'] AS &amp;lt;type&amp;gt;)    AS max_c2,
  CAST(s.minValues['c3'] AS &amp;lt;type&amp;gt;)    AS min_c3,
  CAST(s.maxValues['c3'] AS &amp;lt;type&amp;gt;)    AS max_c3
FROM parsed;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Replace &lt;CODE&gt;s3://bucket/path/my_table&lt;/CODE&gt; with your actual location, and &lt;CODE&gt;&amp;lt;type&amp;gt;&lt;/CODE&gt; with the real SQL types for each column (e.g. &lt;CODE&gt;INT&lt;/CODE&gt;, &lt;CODE&gt;DATE&lt;/CODE&gt;, &lt;CODE&gt;TIMESTAMP&lt;/CODE&gt;).&lt;/P&gt;
&lt;P&gt;You'll get one row per data file showing the min/max range for each clustering column.&lt;/P&gt;
&lt;P&gt;Worth flagging: stats only exist for columns configured for data skipping — by default the first N columns, or those set explicitly via &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;. Columns with types like &lt;CODE&gt;VARIANT&lt;/CODE&gt; won't have min/max stats.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Tue, 03 Mar 2026 12:20:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/149692#M53157</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-03-03T12:20:18Z</dc:date>
    </item>
    <item>
      <title>Hi @smoortema, There are several approaches for inspectin...</title>
      <link>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/150344#M53379</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/147736"&gt;@smoortema&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;There are several approaches for inspecting per-file column statistics on a liquid-clustered Delta table. Here is a walkthrough from simplest to most detailed.&lt;/P&gt;
&lt;P&gt;APPROACH 1: CONFIRM CLUSTERING CONFIGURATION&lt;/P&gt;
&lt;P&gt;First, verify that clustering is set up correctly:&lt;/P&gt;
&lt;PRE&gt;DESCRIBE DETAIL your_catalog.your_schema.your_table;&lt;/PRE&gt;
&lt;P&gt;This returns a row with a clusteringColumns field showing the columns used for liquid clustering. You can also check properties:&lt;/P&gt;
&lt;PRE&gt;SHOW TBLPROPERTIES your_catalog.your_schema.your_table;&lt;/PRE&gt;
&lt;P&gt;Look for clusteringColumns and (if automatic clustering is enabled) clusterByAuto.&lt;/P&gt;
&lt;P&gt;APPROACH 2: READ PER-FILE COLUMN STATISTICS FROM THE TRANSACTION LOG&lt;/P&gt;
&lt;P&gt;Delta Lake stores min/max statistics for each data file in the _delta_log directory as JSON. You can parse this directly in a notebook to see per-file statistics for your clustering columns.&lt;/P&gt;
&lt;P&gt;In Python:&lt;/P&gt;
&lt;PRE&gt;from pyspark.sql.functions import col, from_json, input_file_name
from pyspark.sql.types import *

# Point to the Delta log directory
table_path = spark.sql(
  "DESCRIBE DETAIL your_catalog.your_schema.your_table"
).select("location").collect()[0][0]

log_path = f"{table_path}/_delta_log"

# Read the latest JSON log entries (add actions contain file-level stats)
log_df = (
  spark.read.json(f"{log_path}/*.json")
  .filter(col("add").isNotNull())
  .select(
      col("add.path").alias("file_path"),
      col("add.size").alias("size_bytes"),
      col("add.stats").alias("stats_json"),
      col("add.clusteringProvider").alias("clustering_provider")
  )
)

log_df.show(truncate=False)&lt;/PRE&gt;
&lt;P&gt;The stats_json column contains a JSON string with numRecords, minValues, and maxValues for your indexed columns. To parse it into usable columns:&lt;/P&gt;
&lt;PRE&gt;from pyspark.sql.functions import get_json_object

stats_df = log_df.select(
  "file_path",
  "size_bytes",
  get_json_object("stats_json", "$.numRecords").alias("num_records"),
  get_json_object("stats_json", "$.minValues.col_a").alias("col_a_min"),
  get_json_object("stats_json", "$.maxValues.col_a").alias("col_a_max"),
  get_json_object("stats_json", "$.minValues.col_b").alias("col_b_min"),
  get_json_object("stats_json", "$.maxValues.col_b").alias("col_b_max"),
  get_json_object("stats_json", "$.minValues.col_c").alias("col_c_min"),
  get_json_object("stats_json", "$.maxValues.col_c").alias("col_c_max")
)

stats_df.orderBy("col_a_min", "col_b_min", "col_c_min").show(100, truncate=False)&lt;/PRE&gt;
&lt;P&gt;Replace col_a, col_b, col_c with your actual clustering column names.&lt;/P&gt;
&lt;P&gt;This gives you a row per file with the min and max value for each clustering column, which is exactly what you need to see how tightly the data is clustered. After running OPTIMIZE, you should see the min/max ranges per file become narrower and more distinct (less overlap between files), which is how liquid clustering improves data skipping.&lt;/P&gt;
&lt;P&gt;APPROACH 3: USE THE DELTALOG PYTHON API&lt;/P&gt;
&lt;P&gt;You can also use the DeltaTable API to read the log more cleanly:&lt;/P&gt;
&lt;PRE&gt;from delta.tables import DeltaTable

dt = DeltaTable.forName(spark, "your_catalog.your_schema.your_table")

# Get the detail (file count, size, clustering columns)
dt.detail().show(truncate=False)

# Get history to see OPTIMIZE operations and their metrics
dt.history().filter("operation = 'OPTIMIZE'").select(
  "version", "timestamp", "operationMetrics"
).show(truncate=False)&lt;/PRE&gt;
&lt;P&gt;The operationMetrics map for OPTIMIZE operations includes numFilesRemoved, numFilesAdded, and file size distribution percentiles (minFileSize, p25FileSize, p50FileSize, p75FileSize, maxFileSize), which help you gauge how well the compaction and reclustering went.&lt;/P&gt;
&lt;P&gt;APPROACH 4: CHECKPOINT-BASED INSPECTION FOR LARGE TABLES&lt;/P&gt;
&lt;P&gt;For larger tables where reading all JSON log files is slow, Delta periodically writes Parquet checkpoint files (every 10 commits by default). You can read the latest checkpoint directly:&lt;/P&gt;
&lt;PRE&gt;import os

# Find the latest checkpoint
checkpoint_df = spark.read.parquet(f"{log_path}/_last_checkpoint")
checkpoint_version = checkpoint_df.select("version").collect()[0][0]

# Read the checkpoint Parquet file
cp_df = spark.read.parquet(
  f"{log_path}/{str(checkpoint_version).zfill(20)}.checkpoint.parquet"
)

# Filter for active add actions and extract stats
cp_df.filter(col("add").isNotNull()).select(
  col("add.path").alias("file_path"),
  col("add.size").alias("size_bytes"),
  col("add.stats_parsed.numRecords").alias("num_records"),
  col("add.stats_parsed.minValues.*"),
  col("add.stats_parsed.maxValues.*")
).show(truncate=False)&lt;/PRE&gt;
&lt;P&gt;Checkpoint files have the stats already parsed into a struct (stats_parsed) rather than a JSON string, which makes column extraction more straightforward.&lt;/P&gt;
&lt;P&gt;INTERPRETING THE RESULTS&lt;/P&gt;
&lt;P&gt;When liquid clustering is working well, you will see:&lt;BR /&gt;
- Narrow min/max ranges per file for each clustering column (tight clustering)&lt;BR /&gt;
- Minimal overlap between files on the clustering column ranges&lt;BR /&gt;
- After OPTIMIZE, files that previously had wide or overlapping ranges get rewritten into tighter, non-overlapping segments&lt;/P&gt;
&lt;P&gt;If you see wide, overlapping ranges across many files, it typically means OPTIMIZE has not yet been run (or not run recently enough). Running OPTIMIZE triggers incremental reclustering. On Databricks Runtime 16.0+, you can also run OPTIMIZE table_name FULL to force a complete reclustering when first enabling clustering or after changing clustering keys.&lt;/P&gt;
&lt;P&gt;For reference:&lt;BR /&gt;
&lt;A href="https://docs.databricks.com/en/delta/clustering.html" target="_blank"&gt;https://docs.databricks.com/en/delta/clustering.html&lt;/A&gt;&lt;BR /&gt;
&lt;A href="https://docs.databricks.com/en/delta/data-skipping.html" target="_blank"&gt;https://docs.databricks.com/en/delta/data-skipping.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;
&lt;P&gt;If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.&lt;/P&gt;</description>
      <pubDate>Mon, 09 Mar 2026 05:53:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/check-statistics-of-clustering-columns-per-file-to-see-how/m-p/150344#M53379</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-09T05:53:07Z</dc:date>
    </item>
  </channel>
</rss>

