Liquid clustering on integers or dates seems intuitive. But it's less clear how it would decide to partion files when the key column is a high cardinality string. Does it try the first character of the string and then if that's not unique enough it goes for 2 or three characters? Or does it order the key values and grab a median value, then a median from each of those ranges? Or does it generate a histogram of unique values?
We have a silver table with a 100M rows. The rows are wide, several hundred columns. The key/id for each row is a guid. That's how the system that feeds us data identifies the records. About once an hour we get a microbatch of 50 to 100 records. Bronze is fast because it is SCD type2 and just appends. Silver is SCD type 1 so we Merge. A mix of updates and inserts. 98% of the time, the records are from the past few hours. Edits for data from last quarter does happen but it's rare. Since the guids are random what we see it that the Merge operation has to append to 50-100 different files. Our table is partitioned into 200 files so that is a lot of access for a small operation.

The app team that feeds us data mentioned that they could change the guids to be sequential. With UUIDv7 format the date component is moved to the front of the bits to keep today's data closer together. Would liquid clustering work better with these guids or does it require the entire string to be high cardinality but non-unique like zip codes? Would the optimizer detect and work with these string ids better?
