Databricks Community

korasino · ‎06-02-2025

Hi 🙂

Quick question about optimizing our Delta tables. Photon and Predictive I/O vs. Liquid Clustering (LC).
We have UUIDv4 columns (random, high-cardinality) used in both WHERE uuid = … filters and joins. From what I understand Photon (on Serverless warehouses) automatically does dynamic file pruning - building dynamic bloom style filters while querying and using table statistics for data skipping for point lookups (`WHERE uuid = ...`).
So:
1. LC vs Photon on a UUIDv4:
LC tightens min/max per file on UUIDv4, but Photon also does dynamic pruning already and skips blocks for WHERE uuid = … or joins (?). Is LC on UUIDv4 basically redundant since Photon handles the skipping? Does LC add any extra performance for point lookups or joins on UUIDv4?
2. Could LC on UUIDv4 hurt
UUIDv4 values are random, so LC would distribute those evenly - does this mean that it could actually hurt the rest of our optimization columns (like tstamps, grouping ids)
3. Joins on UUIDv4 with Photon:
When joining two large tables on a random UUID key, Photon will skip non-matching file blocks. Does LC’s min/max on UUIDv4 actually reduce shuffle or I/O for these joins, or does Photon already cover that? for join-heavy workloads on UUIDv4, is LC doing anything extra?
4. Where LC makes sense:
We have other columns that are high-cardinality but naturally ordered—like event timestamps (or maybe UUIDv7 in the future). LC on those should co-locate ranges and improve both filters and joins. Should we focus LC on timestamp or UUIDv7 instead, and just rely on Photon for UUIDv4?
Would love to hear any real-world experiences or best practices. Thanks!

SP_6721 · ‎06-04-2025

Hi @korasino

Liquid Clustering (LC) tightens file-level min/max stats on UUIDv4, but since Photon already handles dynamic pruning and data skipping using bloom-style filters and table stats, LC adds little to no benefit for point lookups (WHERE uuid = ...) or joins.
Because UUIDv4 values are random, LC distributes data evenly across files, which can actually hurt clustering on more useful columns like, timestamps reducing performance for time-based queries.
Photon also handles join filtering efficiently, so LC on UUIDv4 doesn’t help reduce shuffle or I/O further in join-heavy workloads.
Instead, LC is best used on naturally ordered columns like event timestamps or UUIDv7, where it can meaningfully improve query performance. For UUIDv4, relying on Photon alone is typically the better approach.

korasino · ‎06-10-2025

Hey, thanks for the reply. Could you share some documentation links around those bullet points in your answer? thanks!

Databricks Community

Photon and Predictive I/O vs. Liquid Clustering

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐