Databricks Community

thackman · 2 weeks ago

Liquid clustering on integers or dates seems intuitive. But it's less clear how it would decide to partion files when the key column is a high cardinality string. Does it try the first character of the string and then if that's not unique enough it goes for 2 or three characters? Or does it order the key values and grab a median value, then a median from each of those ranges? Or does it generate a histogram of unique values?

We have a silver table with a 100M rows. The rows are wide, several hundred columns. The key/id for each row is a guid. That's how the system that feeds us data identifies the records. About once an hour we get a microbatch of 50 to 100 records. Bronze is fast because it is SCD type2 and just appends. Silver is SCD type 1 so we Merge. A mix of updates and inserts. 98% of the time, the records are from the past few hours. Edits for data from last quarter does happen but it's rare. Since the guids are random what we see it that the Merge operation has to append to 50-100 different files. Our table is partitioned into 200 files so that is a lot of access for a small operation.

The app team that feeds us data mentioned that they could change the guids to be sequential. With UUIDv7 format the date component is moved to the front of the bits to keep today's data closer together. Would liquid clustering work better with these guids or does it require the entire string to be high cardinality but non-unique like zip codes? Would the optimizer detect and work with these string ids better?

Isi · a week ago

Hey @thackman ,

I would share my opinion

Liquid Clustering doesn’t use prefix logic (e.g., first characters) nor a frequency histogram like ZOrdering did. Instead, it performs range-based segmentation based on lexicographic ordering of the column values. This works well when there’s some natural ordering correlation, but with totally random UUIDs (like v4), there is no temporal or logical proximity... which causes LC to spread updates across many files.

Switching to UUIDv7 (or another time-ordered GUID format) would significantly help. These IDs maintain temporal locality because the timestamp is embedded at the beginning of the string. Therefore, new rows inserted around the same time will fall into lexicographically similar clusters, and LC will group them much more effectively. You’d likely end up with MERGEs affecting only 1–2 files instead of 50–100.

So... yes Liquid Clustering would work far better with sequential or time-ordered GUIDs like UUIDv7. It doesn’t require low cardinality strings like zip codes, but it benefits enormously from values that are not random and have a logical grouping when sorted.

If you’re able to switch to UUIDv7, I highly recommend doing so you’ll likely see much faster merges, compactions, and query planning...

Let us know how it goes if you give it a try.

Hope this helps 🙂

Isi

emiliaswan · 17 hours ago

I found this great breakdown of current typography trends that touches on immersive typography and where it's heading. If you're interested in how 3D and AR are evolving in the design world, especially in terms of functionality and aesthetics,

Databricks Community

How does liquid clustering handle high cardinality strings?

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!