How to optimize queries on a 150B table? ZORDER, LC or partioning?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2024 12:26 AM - edited 11-08-2024 12:27 AM
Hi!
I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.
Selecting max(date_time_utc) from this table to find the latest row can take a very long time.
We (almost) always query by date_time_utc and/or car_id.
I'm hoping for some input on partitioning strategies:
1. Does Liquid Clustering replace partitioning and zorder completely? It doesn't rewrite files as far as I understand.
2. For a 150B rows table, what is an optimal strategy?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2024 04:10 AM
1. According to the databricks yes 🙂 But as always, I recommend to perform benchamarks yourself. There a lot of blog posts, that are saying that it's not alway the case. Yesterday, I was at data community event and presenter did several benchmark and he showed in some cases ZORDER performed better than Liquid Clustering. So, take marketing stuff with a pinch of salt 🙂
2. If I were on your place, I would try to benchmark it. Maybe try to use partitioning + ZORDER and compared it with Liquid Clustering.

