Databricks Community

oakhill · ‎11-08-2024

Hi!

I am struggling to understand how to properly manage my table to make queries effective. My table has columns date_time_utc, car_id, car_owner etc. date_time_utc, car_id and position is usually the ZORDER or Liquid Clustering-columns.

Selecting max(date_time_utc) from this table to find the latest row can take a very long time.

We (almost) always query by date_time_utc and/or car_id.

I'm hoping for some input on partitioning strategies:

1. Does Liquid Clustering replace partitioning and zorder completely? It doesn't rewrite files as far as I understand.

2. For a 150B rows table, what is an optimal strategy?

szymon_dybczak · ‎11-08-2024

1. According to the databricks yes 🙂 But as always, I recommend to perform benchamarks yourself. There a lot of blog posts, that are saying that it's not alway the case. Yesterday, I was at data community event and presenter did several benchmark and he showed in some cases ZORDER performed better than Liquid Clustering. So, take marketing stuff with a pinch of salt 🙂

2. If I were on your place, I would try to benchmark it. Maybe try to use partitioning + ZORDER and compared it with Liquid Clustering.