<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to optimize queries on a 150B table? ZORDER, LC or partioning? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-queries-on-a-150b-table-zorder-lc-or-partioning/m-p/98174#M39630</link>
    <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.&lt;/P&gt;&lt;P&gt;Selecting max(date_time_utc) from this table to find the latest row can take a very long time.&lt;BR /&gt;&lt;BR /&gt;We (almost) always query by date_time_utc and/or car_id.&lt;/P&gt;&lt;P&gt;I'm hoping for some input on partitioning strategies:&lt;/P&gt;&lt;P&gt;1. Does Liquid Clustering replace partitioning and zorder completely? It doesn't rewrite files as far as I understand.&lt;/P&gt;&lt;P&gt;2. For a 150B rows table, what is an optimal strategy?&lt;/P&gt;</description>
    <pubDate>Fri, 08 Nov 2024 08:27:20 GMT</pubDate>
    <dc:creator>oakhill</dc:creator>
    <dc:date>2024-11-08T08:27:20Z</dc:date>
    <item>
      <title>How to optimize queries on a 150B table? ZORDER, LC or partioning?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-queries-on-a-150b-table-zorder-lc-or-partioning/m-p/98174#M39630</link>
      <description>&lt;P&gt;Hi!&lt;/P&gt;&lt;P&gt;I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.&lt;/P&gt;&lt;P&gt;Selecting max(date_time_utc) from this table to find the latest row can take a very long time.&lt;BR /&gt;&lt;BR /&gt;We (almost) always query by date_time_utc and/or car_id.&lt;/P&gt;&lt;P&gt;I'm hoping for some input on partitioning strategies:&lt;/P&gt;&lt;P&gt;1. Does Liquid Clustering replace partitioning and zorder completely? It doesn't rewrite files as far as I understand.&lt;/P&gt;&lt;P&gt;2. For a 150B rows table, what is an optimal strategy?&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 08:27:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-queries-on-a-150b-table-zorder-lc-or-partioning/m-p/98174#M39630</guid>
      <dc:creator>oakhill</dc:creator>
      <dc:date>2024-11-08T08:27:20Z</dc:date>
    </item>
    <item>
      <title>Re: How to optimize queries on a 150B table? ZORDER, LC or partioning?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-queries-on-a-150b-table-zorder-lc-or-partioning/m-p/98182#M39635</link>
      <description>&lt;P&gt;1. According to the databricks yes &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; But as always, I recommend to perform benchamarks yourself. There a lot of blog posts, that are saying that it's not alway the case. Yesterday, I was at data community event and presenter did several benchmark and he showed in some cases ZORDER performed better than Liquid Clustering. So, take marketing stuff with a pinch of salt &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;2. If I were on your place, I would try to benchmark it. Maybe try to use partitioning + ZORDER and compared it&amp;nbsp; with Liquid Clustering.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 12:10:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-queries-on-a-150b-table-zorder-lc-or-partioning/m-p/98182#M39635</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-08T12:10:48Z</dc:date>
    </item>
  </channel>
</rss>

