Re: How does liquid clustering handle high cardina...

Isi · ‎04-13-2025

Hey @thackman ,

I would share my opinion

Liquid Clustering doesn’t use prefix logic (e.g., first characters) nor a frequency histogram like ZOrdering did. Instead, it performs range-based segmentation based on lexicographic ordering of the column values. This works well when there’s some natural ordering correlation, but with totally random UUIDs (like v4), there is no temporal or logical proximity... which causes LC to spread updates across many files.

Switching to UUIDv7 (or another time-ordered GUID format) would significantly help. These IDs maintain temporal locality because the timestamp is embedded at the beginning of the string. Therefore, new rows inserted around the same time will fall into lexicographically similar clusters, and LC will group them much more effectively. You’d likely end up with MERGEs affecting only 1–2 files instead of 50–100.

So... yes Liquid Clustering would work far better with sequential or time-ordered GUIDs like UUIDv7. It doesn’t require low cardinality strings like zip codes, but it benefits enormously from values that are not random and have a logical grouping when sorted.

If you’re able to switch to UUIDv7, I highly recommend doing so you’ll likely see much faster merges, compactions, and query planning...

Let us know how it goes if you give it a try.

Hope this helps 🙂

Isi

View solution in original post