<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Liquid clustering performance issue in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121821#M46563</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/169657"&gt;@OODataEng&lt;/a&gt;, could you please elaborate on what exact command you used to create a new table using the older one?&lt;/P&gt;</description>
    <pubDate>Sun, 15 Jun 2025 17:25:52 GMT</pubDate>
    <dc:creator>nikhilj0421</dc:creator>
    <dc:date>2025-06-15T17:25:52Z</dc:date>
    <item>
      <title>Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121812#M46556</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hello,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have a table with approximately 300 million records. It weighs 3.4 GB and consists of 305 files.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I wanted to create liquid clustering for it and chose a date column as the key for clustering.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;When I created a new table with the above details based on the original table, the size of the table increased to nearly 5 GB. Although the number of files indeed decreased to 24, query performance was slower on the new table, and in the performance metrics, I noticed that the amount of data being read was larger compared to the original table without clustering.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;What am I missing here? Why is this happening?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 15 Jun 2025 14:29:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121812#M46556</guid>
      <dc:creator>OODataEng</dc:creator>
      <dc:date>2025-06-15T14:29:35Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121817#M46561</link>
      <description>&lt;DIV&gt;Hey &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/169657"&gt;@OODataEng&lt;/a&gt;&amp;nbsp;,&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Great question—and you're definitely not alone in running into this when first applying liquid clustering.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;It sounds counterintuitive at first: you apply a performance optimization (clustering), but the table size increases and the queries slow down.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Let me break down what's likely happening and share some insights that might help.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Why Your Table Size Increased (3.4 GB ➝ 5 GB)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;When you enable liquid clustering, the engine rewrites the table into fewer, larger files that are organized by the clustering key (in your case, a date column).&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;While this reduces the number of files (305 ➝ 24), it also introduces some overhead:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;1. Compression efficiency drops: If the original files were compressed better (due to more uniform data chunks), the new clustered layout might actually be less&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;compressible—especially if the date column has a wide range of values per file.&lt;/DIV&gt;&lt;DIV&gt;2. Metadata overhead: Clustering introduces additional metadata to help with file pruning and optimization. That metadata has a cost, especially at scale.&lt;/DIV&gt;&lt;DIV&gt;3. Larger files can mean more I/O: If your queries only need small slices of data, larger files mean more unnecessary data gets read.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Why Your Queries Are Slower&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Even though clustering is meant to improve performance, it depends heavily on how you query the data:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Are your queries filtering on the date column?&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;If not, the clustering isn’t helping prune files—and you’re paying the price of larger files without any benefit.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;How selective is the date filter?&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;If you’re scanning large date ranges, the system still needs to read most of the files.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Was your original table accidentally optimized?&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;It’s possible the original 305 small files gave better parallelism or caching benefits.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;What You Can Try Next&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Here are some tips to investigate and improve things:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;1. Check pruning effectiveness&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;Run this to see how much data is being skipped:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;explain select * from your_table where date_col = '2024-01-01';&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;If it's still reading most of the data, clustering isn’t helping much yet.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;2. Use `describe detail` or `input_file_name()`&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;These can help you analyze file sizes and distribution after clustering.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;3. Test with Z-Ordering (if applicable)&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;If your queries filter on multiple columns (not just date), consider Z-ordering for multi-dimensional clustering.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;4. Try a smaller clustering target&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;Smaller file sizes (using config like `delta.targetFileSize`) may help restore parallelism.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;5. Review distribution of your clustering column&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp; &amp;nbsp;Is the data evenly spread across dates, or is there a skew (e.g., most records fall in a few days)? Skewed data can lead to unbalanced files, which slows things down.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;In Summary&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Liquid clustering is a powerful tool, but it’s not a silver bullet. Its effectiveness depends heavily on:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;What column you use for clustering&lt;/DIV&gt;&lt;DIV&gt;How your queries are written&lt;/DIV&gt;&lt;DIV&gt;How your data is distributed&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;You’re on the right track by experimenting with it! I’d recommend measuring query performance again after a few days or with different filters—sometimes clustering&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;benefits kick in once the system has had time to optimize further.&lt;/DIV&gt;</description>
      <pubDate>Sun, 15 Jun 2025 15:31:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121817#M46561</guid>
      <dc:creator>HariSankar</dc:creator>
      <dc:date>2025-06-15T15:31:10Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121821#M46563</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/169657"&gt;@OODataEng&lt;/a&gt;, could you please elaborate on what exact command you used to create a new table using the older one?&lt;/P&gt;</description>
      <pubDate>Sun, 15 Jun 2025 17:25:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121821#M46563</guid>
      <dc:creator>nikhilj0421</dc:creator>
      <dc:date>2025-06-15T17:25:52Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121827#M46565</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;&lt;P&gt;create table x cluster by (col1,col2,col3)&lt;/P&gt;&lt;P&gt;as select * from table y&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;and than I run optimize command on table x&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Sun, 15 Jun 2025 18:00:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121827#M46565</guid>
      <dc:creator>OODataEng</dc:creator>
      <dc:date>2025-06-15T18:00:11Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121828#M46566</link>
      <description>&lt;P&gt;I tried it with 1 column as a clustered key and also with 3 columns, both scenarios the size of the table increased&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 15 Jun 2025 18:03:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121828#M46566</guid>
      <dc:creator>OODataEng</dc:creator>
      <dc:date>2025-06-15T18:03:19Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121868#M46577</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I really appreciate your detailed response.&lt;BR /&gt;I tried changing the key to a column with higher cardinality—a column that includes both date and time (timestamp)—to achieve a more even distribution.&lt;BR /&gt;Now the size has dropped to 4.3 GiB, and the number of files is 64.&lt;BR /&gt;&lt;BR /&gt;Currently, the queries performed similarly, with a slight advantage for the table with the CLUSTER BY, when filtering on the clustered column. However, the amount of MB read was still higher.&lt;BR /&gt;I was under the impression that, unlike PARTITION BY, I don’t need to find a column with an even distribution for CLUSTER BY.&lt;BR /&gt;&lt;BR /&gt;Could you clarify in which cases I would see a significant improvement with a CLUSTER BY table compared to a table with no optimization strategy at all?&lt;BR /&gt;&lt;BR /&gt;Thank you!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jun 2025 12:23:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121868#M46577</guid>
      <dc:creator>OODataEng</dc:creator>
      <dc:date>2025-06-16T12:23:01Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid clustering performance issue</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121885#M46587</link>
      <description>&lt;P&gt;Hey @OODDATAEng&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;To create a new table in Databricks using the schema and data from an existing table, you can use the CREATE TABLE AS SELECT command. This command allows you to define a new table based on the results of a SELECT query executed on the existing table.&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Here's a breakdown of the command and its components:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Code&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;CREATE TABLE new_table_name&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;AS&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;SELECT * FROM existing_table_name;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;CREATE TABLE new_table_name:&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jun 2025 15:25:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-performance-issue/m-p/121885#M46587</guid>
      <dc:creator>Yogesh_Verma_</dc:creator>
      <dc:date>2025-06-16T15:25:23Z</dc:date>
    </item>
  </channel>
</rss>

