<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: From Partitioning to Liquid Clustering in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/113982#M44691</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98256"&gt;@Volker&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P class=""&gt;First of all, I’d recommend considering &lt;SPAN class=""&gt;&lt;STRONG&gt;Auto Liquid Clustering&lt;/STRONG&gt;&lt;/SPAN&gt;, as it can simplify the process of defining clustering keys.&lt;/P&gt;&lt;P class=""&gt;You can read more about it in the &lt;A href="https://docs.databricks.com/aws/en/delta/clustering#automatic-liquid-clustering" target="_blank" rel="noopener"&gt;Databricks documentation&lt;/A&gt;&amp;nbsp;(it’s currently in &lt;SPAN class=""&gt;&lt;STRONG&gt;Public Preview&lt;/STRONG&gt;&lt;/SPAN&gt;, but you can probably start using it already)&lt;/P&gt;&lt;P class=""&gt;Since the official docs are still limited, here’s a quick summary of the criteria used by the backend to trigger Liquid Clustering:&lt;/P&gt;&lt;P class=""&gt;• The table size must be &lt;SPAN class=""&gt;&lt;STRONG&gt;at least 256 MB&lt;/STRONG&gt;&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;• There must be &lt;SPAN class=""&gt;&lt;STRONG&gt;at least 10 pruning-eligible scans&lt;/STRONG&gt;&lt;/SPAN&gt; with pruning predicates.&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• The clustering key &lt;/SPAN&gt;&lt;STRONG&gt;must not have been changed in the last 2 weeks&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;• It usually takes &lt;SPAN class=""&gt;&lt;STRONG&gt;2 to 5 hours&lt;/STRONG&gt;&lt;/SPAN&gt; for the table to reflect the Liquid Clustering key after the conditions are met.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Answering your questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;•&amp;nbsp;&lt;SPAN class=""&gt;&lt;STRONG&gt;Yes&lt;/STRONG&gt;&lt;/SPAN&gt;, deleted files should be removed with &lt;SPAN class=""&gt;VACUUM&lt;/SPAN&gt; after 7 days — this is the default behavior.&lt;/P&gt;&lt;P class=""&gt;•&amp;nbsp;&lt;SPAN class=""&gt;&lt;STRONG&gt;Yes&lt;/STRONG&gt;&lt;/SPAN&gt;, Liquid Clustering can handle full timestamps like &lt;SPAN class=""&gt;processing_dttm&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;However, &lt;SPAN class=""&gt;&lt;STRONG&gt;using a timestamp with minutes and seconds can lead to too many small clusters&lt;/STRONG&gt;&lt;/SPAN&gt; if the values are highly distinct and that level of granularity isn’t relevant for filtering. In such cases, this may reduce clustering efficiency rather than improve it.&lt;/P&gt;&lt;P class=""&gt;Maybe if your queries don’t require high precision, I recommend using &lt;SPAN class=""&gt;truncated versions of your timestamp&lt;/SPAN&gt; when filtering&lt;BR /&gt;&lt;BR /&gt;Hopee this helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 30 Mar 2025 00:07:29 GMT</pubDate>
    <dc:creator>Isi</dc:creator>
    <dc:date>2025-03-30T00:07:29Z</dc:date>
    <item>
      <title>From Partitioning to Liquid Clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/113188#M44456</link>
      <description>&lt;P&gt;We had some delta tables that where previously partitioned on year, month, day, and hour. This resulted in quite small partitions and we now switched to liquid clustering.&lt;/P&gt;&lt;P&gt;We followed these steps:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Remove partitioning by doing REPLACE&lt;/LI&gt;&lt;LI&gt;ALTER TABLE --- CLUSTER BY&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Run OPTIMIZE --- FULL&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I see in the query output that some files have been written and some have been removed but in the underlying s3 bucket I still see the parquet files in the old hive-style partition layout.&lt;/P&gt;&lt;P&gt;Are these old files that will be removed by some VACUUM job or what does OPTIMIZE do if the data in the root bucket is stored in the hive-style partition layout even though we removed the partitioning from the delta table.&lt;/P&gt;&lt;P&gt;Also we are now using the processing_dttm as cluster key instead of year, month, day, hour. The processing_dttm column contains th dttm like so:&amp;nbsp;&lt;SPAN&gt;2024-11-19T09:30:00.765+00:00.&amp;nbsp;&lt;BR /&gt;Would it be better to only include year, month and day or maybe hour instead of minutes or seconds? Or is liquid clustering smart enough the infere this from the dttm?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 20 Mar 2025 16:16:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/113188#M44456</guid>
      <dc:creator>Volker</dc:creator>
      <dc:date>2025-03-20T16:16:31Z</dc:date>
    </item>
    <item>
      <title>Re: From Partitioning to Liquid Clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/113982#M44691</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98256"&gt;@Volker&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P class=""&gt;First of all, I’d recommend considering &lt;SPAN class=""&gt;&lt;STRONG&gt;Auto Liquid Clustering&lt;/STRONG&gt;&lt;/SPAN&gt;, as it can simplify the process of defining clustering keys.&lt;/P&gt;&lt;P class=""&gt;You can read more about it in the &lt;A href="https://docs.databricks.com/aws/en/delta/clustering#automatic-liquid-clustering" target="_blank" rel="noopener"&gt;Databricks documentation&lt;/A&gt;&amp;nbsp;(it’s currently in &lt;SPAN class=""&gt;&lt;STRONG&gt;Public Preview&lt;/STRONG&gt;&lt;/SPAN&gt;, but you can probably start using it already)&lt;/P&gt;&lt;P class=""&gt;Since the official docs are still limited, here’s a quick summary of the criteria used by the backend to trigger Liquid Clustering:&lt;/P&gt;&lt;P class=""&gt;• The table size must be &lt;SPAN class=""&gt;&lt;STRONG&gt;at least 256 MB&lt;/STRONG&gt;&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;• There must be &lt;SPAN class=""&gt;&lt;STRONG&gt;at least 10 pruning-eligible scans&lt;/STRONG&gt;&lt;/SPAN&gt; with pruning predicates.&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;• The clustering key &lt;/SPAN&gt;&lt;STRONG&gt;must not have been changed in the last 2 weeks&lt;/STRONG&gt;&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;• It usually takes &lt;SPAN class=""&gt;&lt;STRONG&gt;2 to 5 hours&lt;/STRONG&gt;&lt;/SPAN&gt; for the table to reflect the Liquid Clustering key after the conditions are met.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Answering your questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;•&amp;nbsp;&lt;SPAN class=""&gt;&lt;STRONG&gt;Yes&lt;/STRONG&gt;&lt;/SPAN&gt;, deleted files should be removed with &lt;SPAN class=""&gt;VACUUM&lt;/SPAN&gt; after 7 days — this is the default behavior.&lt;/P&gt;&lt;P class=""&gt;•&amp;nbsp;&lt;SPAN class=""&gt;&lt;STRONG&gt;Yes&lt;/STRONG&gt;&lt;/SPAN&gt;, Liquid Clustering can handle full timestamps like &lt;SPAN class=""&gt;processing_dttm&lt;/SPAN&gt;.&lt;/P&gt;&lt;P class=""&gt;However, &lt;SPAN class=""&gt;&lt;STRONG&gt;using a timestamp with minutes and seconds can lead to too many small clusters&lt;/STRONG&gt;&lt;/SPAN&gt; if the values are highly distinct and that level of granularity isn’t relevant for filtering. In such cases, this may reduce clustering efficiency rather than improve it.&lt;/P&gt;&lt;P class=""&gt;Maybe if your queries don’t require high precision, I recommend using &lt;SPAN class=""&gt;truncated versions of your timestamp&lt;/SPAN&gt; when filtering&lt;BR /&gt;&lt;BR /&gt;Hopee this helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 30 Mar 2025 00:07:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/113982#M44691</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-03-30T00:07:29Z</dc:date>
    </item>
    <item>
      <title>Re: From Partitioning to Liquid Clustering</title>
      <link>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/155203#M54207</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/98256"&gt;@Volker&lt;/a&gt;&amp;nbsp;, we are in Private Preview now for a feature that helps you easily convert a table from Partitioning to Liquid Clustering. Here is the &lt;A href="https://docs.google.com/document/d/1Txuf72EzrF9PdfVPOYRcba3ca3aUt7TL3AbyMZvnv20/edit?usp=sharing" target="_self"&gt;User Guide&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2026 14:07:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/from-partitioning-to-liquid-clustering/m-p/155203#M54207</guid>
      <dc:creator>jeffrey-gong</dc:creator>
      <dc:date>2026-04-22T14:07:51Z</dc:date>
    </item>
  </channel>
</rss>

