<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Should I enable Liquid Clustering based on table size distribution? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/should-i-enable-liquid-clustering-based-on-table-size/m-p/138937#M51055</link>
    <description>&lt;P class="qt3gz91 paragraph"&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170125"&gt;@pooja_bhumandla&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What Liquid Clustering is best for&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For very small tables, &lt;STRONG&gt;Automatic Liquid Clustering&lt;/STRONG&gt; may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Assessment for your environment&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Large (&amp;gt;1 TB) — 3 tables:&lt;/STRONG&gt; Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Medium (10 GB – 1 TB) — 284 tables:&lt;/STRONG&gt; High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Small (&amp;lt; 10 GB) — 17,266 tables:&lt;/STRONG&gt; Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.&lt;/P&gt;
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Expected impact&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Customers have observed read performance improvements of &lt;STRONG&gt;2–12x&lt;/STRONG&gt; versus traditional partitioning/ZORDER methods, depending on workload and key selection.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Best practices and rollout plan&lt;/H3&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;1) Choose the right clustering keys&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;2) Let Databricks pick keys where possible&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For Unity Catalog managed tables, enable &lt;STRONG&gt;CLUSTER BY AUTO&lt;/STRONG&gt; (requires Predictive Optimization) so Databricks analyzes query history and selects/adapts keys automatically when the predicted savings outweigh clustering costs.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql qt3gz9e hljs language-sql _1ymogdh2"&gt;&lt;SPAN class="hljs-comment"&gt;-- New table&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;CREATE TABLE&lt;/SPAN&gt; my_uc_tbl (...) CLUSTER &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; AUTO;

&lt;SPAN class="hljs-comment"&gt;-- Existing table&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;ALTER TABLE&lt;/SPAN&gt; my_uc_tbl CLUSTER &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; AUTO;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="_17yk06p0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;3) Trigger clustering and keep it incremental&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Use &lt;CODE class="qt3gz9f"&gt;OPTIMIZE&lt;/CODE&gt; regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql qt3gz9e hljs language-sql _1ymogdh2"&gt;OPTIMIZE my_tbl;            &lt;SPAN class="hljs-comment"&gt;-- incremental reclustering&lt;/SPAN&gt;
OPTIMIZE my_tbl &lt;SPAN class="hljs-keyword"&gt;FULL&lt;/SPAN&gt;;       &lt;SPAN class="hljs-comment"&gt;-- force reclustering of all records (first enable or key change)&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For large tables, run &lt;CODE class="qt3gz9f"&gt;OPTIMIZE FULL&lt;/CODE&gt; once when enabling LC or changing keys; this can take hours but sets a clean baseline.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;4) Understand clustering-on-write thresholds&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Clustering on write triggers only when an individual transaction meets size thresholds; if your inserts are small, rely on periodic OPTIMIZE to finish clustering. Thresholds (UC managed tables): ~64 MB for 1 key, 256 MB for 2 keys, 512 MB for 3 keys, 1 GB for 4 keys.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;5) Streaming and ingestion considerations&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Structured Streaming can cluster on write by enabling &lt;CODE class="qt3gz9f"&gt;spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true&lt;/CODE&gt;, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;6) Compatibility and runtime&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;LC requires upgraded Delta protocols (writer v7, reader v3) and deletion vectors; you cannot downgrade table protocol versions. LC is GA for Delta Lake in DBR 15.2+, and Databricks recommends recent runtimes for best performance (e.g., DBR 17.2+ improves OPTIMIZE performance on large tables).&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;7) Alternatives for small tables&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Prefer &lt;STRONG&gt;broadcast joins&lt;/STRONG&gt;, keep &lt;STRONG&gt;table statistics&lt;/STRONG&gt; up to date (&lt;CODE class="qt3gz9f"&gt;ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS&lt;/CODE&gt;), and use &lt;STRONG&gt;Delta caching&lt;/STRONG&gt; for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;&lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt; General table layout guidance&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;Predictive Optimization&lt;/STRONG&gt; on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Recommended approach for your mix&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Thu, 13 Nov 2025 14:25:37 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2025-11-13T14:25:37Z</dc:date>
    <item>
      <title>Should I enable Liquid Clustering based on table size distribution?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-enable-liquid-clustering-based-on-table-size/m-p/138919#M51046</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m evaluating whether Liquid Clustering would be beneficial for the tables based on the sizes. Below is the size distribution of tables in my environment:&lt;/P&gt;&lt;DIV class=""&gt;Size Bucket Table Count &lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Large (&amp;gt; 1 TB)&lt;/TD&gt;&lt;TD&gt;3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Medium (10 GB – 1 TB)&lt;/TD&gt;&lt;TD&gt;284&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;Small (&amp;lt; 10 GB)&lt;/TD&gt;&lt;TD&gt;17,266&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/DIV&gt;&lt;P&gt;Given this distribution, would Liquid Clustering provide significant performance improvements, or is it better suited for environments with more large tables?&lt;/P&gt;&lt;P&gt;Any best practices or considerations would be greatly appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Nov 2025 12:42:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-enable-liquid-clustering-based-on-table-size/m-p/138919#M51046</guid>
      <dc:creator>pooja_bhumandla</dc:creator>
      <dc:date>2025-11-13T12:42:31Z</dc:date>
    </item>
    <item>
      <title>Re: Should I enable Liquid Clustering based on table size distribution?</title>
      <link>https://community.databricks.com/t5/data-engineering/should-i-enable-liquid-clustering-based-on-table-size/m-p/138937#M51055</link>
      <description>&lt;P class="qt3gz91 paragraph"&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170125"&gt;@pooja_bhumandla&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="qt3gz91 paragraph"&gt;Based on your size distribution, enabling Liquid Clustering can provide meaningful gains—but you’ll get the highest ROI by prioritizing your medium and large tables first and selectively applying it to small tables where query patterns warrant it.&lt;/P&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;What Liquid Clustering is best for&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt; replaces partitioning and ZORDER and is recommended for new Delta tables to simplify layout decisions and improve data skipping and file pruning for filtered queries.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;It is particularly beneficial for tables that are frequently filtered by high‑cardinality columns, have skewed distributions, grow quickly, see concurrent writes, or have evolving access patterns.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For very small tables, &lt;STRONG&gt;Automatic Liquid Clustering&lt;/STRONG&gt; may choose not to select keys because the table is too small to benefit; LC’s gains are workload-dependent and cost-aware.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Assessment for your environment&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Large (&amp;gt;1 TB) — 3 tables:&lt;/STRONG&gt; Strong candidates. Enable LC and cluster by the most common filter columns; run an initial OPTIMIZE FULL to recluster existing data and schedule ongoing OPTIMIZE (or enable Predictive Optimization). Expect noticeable read speedups and simpler maintenance compared to partitioning/ZORDER.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Medium (10 GB – 1 TB) — 284 tables:&lt;/STRONG&gt; High ROI segment. LC generally improves file pruning and read latencies; recommended for new and existing Delta tables. Avoid over‑specifying clustering keys on smaller side of this range (see key count guidance below).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;&lt;STRONG&gt;Small (&amp;lt; 10 GB) — 17,266 tables:&lt;/STRONG&gt; Apply LC selectively. If the tables are mostly broadcast joined, scanned in full, or used as small dimensions, LC offers limited benefit versus simply relying on broadcast joins, table stats, and caching.&lt;/P&gt;
If these small tables are frequently filtered on high‑cardinality columns, LC can still help with data skipping, but Automatic LC may skip keys if they’re very small; weigh OPTIMIZE costs vs. query savings.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Expected impact&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Customers have observed read performance improvements of &lt;STRONG&gt;2–12x&lt;/STRONG&gt; versus traditional partitioning/ZORDER methods, depending on workload and key selection.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Best practices and rollout plan&lt;/H3&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;1) Choose the right clustering keys&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Start with 1–2 keys most frequently used in filters, especially high‑cardinality columns (e.g., timestamp, user_id). On smaller tables (under ~10 TB), too many keys can hurt single‑column filter performance; cap at two unless multi‑column filters dominate.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;If you previously partitioned or Z‑ordered, reuse those columns as LC keys (partition columns, ZORDER BY columns).&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;2) Let Databricks pick keys where possible&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For Unity Catalog managed tables, enable &lt;STRONG&gt;CLUSTER BY AUTO&lt;/STRONG&gt; (requires Predictive Optimization) so Databricks analyzes query history and selects/adapts keys automatically when the predicted savings outweigh clustering costs.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql qt3gz9e hljs language-sql _1ymogdh2"&gt;&lt;SPAN class="hljs-comment"&gt;-- New table&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;CREATE TABLE&lt;/SPAN&gt; my_uc_tbl (...) CLUSTER &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; AUTO;

&lt;SPAN class="hljs-comment"&gt;-- Existing table&lt;/SPAN&gt;
&lt;SPAN class="hljs-keyword"&gt;ALTER TABLE&lt;/SPAN&gt; my_uc_tbl CLUSTER &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; AUTO;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="go8b9g3 _7pq7t62w _7pq7t6ck _7pq7t6aw _7pq7t6bm"&gt;
&lt;DIV class="_17yk06p0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;3) Trigger clustering and keep it incremental&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Use &lt;CODE class="qt3gz9f"&gt;OPTIMIZE&lt;/CODE&gt; regularly; it incrementally reclusters only what’s needed. Without Predictive Optimization, schedule OPTIMIZE hourly or every few hours on highly updated tables; with PO, disable manual schedules and let PO run it.
&lt;DIV class="_7pq7t614 _7pq7t6cj wrz27r2 wrz27r0"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="go8b9g1 _7pq7t6cj" data-ui-element="code-block-container"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-sql qt3gz9e hljs language-sql _1ymogdh2"&gt;OPTIMIZE my_tbl;            &lt;SPAN class="hljs-comment"&gt;-- incremental reclustering&lt;/SPAN&gt;
OPTIMIZE my_tbl &lt;SPAN class="hljs-keyword"&gt;FULL&lt;/SPAN&gt;;       &lt;SPAN class="hljs-comment"&gt;-- force reclustering of all records (first enable or key change)&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;For large tables, run &lt;CODE class="qt3gz9f"&gt;OPTIMIZE FULL&lt;/CODE&gt; once when enabling LC or changing keys; this can take hours but sets a clean baseline.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;4) Understand clustering-on-write thresholds&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Clustering on write triggers only when an individual transaction meets size thresholds; if your inserts are small, rely on periodic OPTIMIZE to finish clustering. Thresholds (UC managed tables): ~64 MB for 1 key, 256 MB for 2 keys, 512 MB for 3 keys, 1 GB for 4 keys.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;5) Streaming and ingestion considerations&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Structured Streaming can cluster on write by enabling &lt;CODE class="qt3gz9f"&gt;spark.databricks.delta.liquid.eagerClustering.streaming.enabled=true&lt;/CODE&gt;, but thresholds apply and many streaming micro-batches won’t trigger; OPTIMIZE catches up incrementally.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;6) Compatibility and runtime&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;LC requires upgraded Delta protocols (writer v7, reader v3) and deletion vectors; you cannot downgrade table protocol versions. LC is GA for Delta Lake in DBR 15.2+, and Databricks recommends recent runtimes for best performance (e.g., DBR 17.2+ improves OPTIMIZE performance on large tables).&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;7) Alternatives for small tables&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;Prefer &lt;STRONG&gt;broadcast joins&lt;/STRONG&gt;, keep &lt;STRONG&gt;table statistics&lt;/STRONG&gt; up to date (&lt;CODE class="qt3gz9f"&gt;ANALYZE TABLE ... COMPUTE STATISTICS FOR ALL COLUMNS&lt;/CODE&gt;), and use &lt;STRONG&gt;Delta caching&lt;/STRONG&gt; for repeated access; these tend to deliver more benefit than LC for tiny lookup/dimension tables.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 class="_7uu25p0 qt3gz9c _7pq7t612 heading4 _7uu25p1"&gt;&lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt; General table layout guidance&lt;/H4&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Avoid partitioning tables under ~1 TB; LC (or ingestion‑time clustering + auto‑optimize) is a better choice to prevent small‑file and skew issues while retaining data skipping.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Use &lt;STRONG&gt;Predictive Optimization&lt;/STRONG&gt; on UC managed tables to automate OPTIMIZE/VACUUM and LC maintenance across your estate.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_7uu25p0 qt3gz9c _7pq7t612 heading3 _7uu25p1"&gt;Recommended approach for your mix&lt;/H3&gt;
&lt;UL class="qt3gz97 qt3gz92"&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;Prioritize enabling LC (or LC AUTO) on the 3 large and 284 medium tables, beginning with those most queried via filters and those with skew/concurrency; measure query latency and scan bytes before/after.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="qt3gz9a"&gt;
&lt;P class="qt3gz91 paragraph"&gt;For the 17k+ small tables, only enable LC where the workload is filter‑heavy on high‑cardinality columns; otherwise rely on broadcast/stats/caching for best overall cost‑performance and leave LC to PO where it decides it’s worthwhile.&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 13 Nov 2025 14:25:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/should-i-enable-liquid-clustering-based-on-table-size/m-p/138937#M51055</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-13T14:25:37Z</dc:date>
    </item>
  </channel>
</rss>

