Hi 🙂
Quick question about optimizing our Delta tables. Photon and Predictive I/O vs. Liquid Clustering (LC).
We have UUIDv4 columns (random, high-cardinality) used in both WHERE uuid = … filters and joins. From what I understand Photon (on Serverless warehouses) automatically does dynamic file pruning - building dynamic bloom style filters while querying and using table statistics for data skipping for point lookups (`WHERE uuid = ...`).
So:
1. LC vs Photon on a UUIDv4:
LC tightens min/max per file on UUIDv4, but Photon also does dynamic pruning already and skips blocks for WHERE uuid = … or joins (?). Is LC on UUIDv4 basically redundant since Photon handles the skipping? Does LC add any extra performance for point lookups or joins on UUIDv4?
2. Could LC on UUIDv4 hurt
UUIDv4 values are random, so LC would distribute those evenly - does this mean that it could actually hurt the rest of our optimization columns (like tstamps, grouping ids)
3. Joins on UUIDv4 with Photon:
When joining two large tables on a random UUID key, Photon will skip non-matching file blocks. Does LC’s min/max on UUIDv4 actually reduce shuffle or I/O for these joins, or does Photon already cover that? for join-heavy workloads on UUIDv4, is LC doing anything extra?
4. Where LC makes sense:
We have other columns that are high-cardinality but naturally ordered—like event timestamps (or maybe UUIDv7 in the future). LC on those should co-locate ranges and improve both filters and joins. Should we focus LC on timestamp or UUIDv7 instead, and just rely on Photon for UUIDv4?
Would love to hear any real-world experiences or best practices. Thanks!