cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Photon and Predictive I/O vs. Liquid Clustering

korasino
New Contributor II

Hi 🙂

Quick question about optimizing our Delta tables. Photon and Predictive I/O vs. Liquid Clustering (LC).
We have UUIDv4 columns (random, high-cardinality) used in both WHERE uuid = … filters and joins. From what I understand Photon (on Serverless warehouses) automatically does dynamic file pruning - building dynamic bloom style filters while querying and using table statistics for data skipping for point lookups (`WHERE uuid = ...`).
So:
1. LC vs Photon on a UUIDv4:
LC tightens min/max per file on UUIDv4, but Photon also does dynamic pruning already and skips blocks for WHERE uuid = … or joins (?). Is LC on UUIDv4 basically redundant since Photon handles the skipping? Does LC add any extra performance for point lookups or joins on UUIDv4?
2. Could LC on UUIDv4 hurt
UUIDv4 values are random, so LC would distribute those evenly - does this mean that it could actually hurt the rest of our optimization columns (like tstamps, grouping ids)
3. Joins on UUIDv4 with Photon:
When joining two large tables on a random UUID key, Photon will skip non-matching file blocks. Does LC’s min/max on UUIDv4 actually reduce shuffle or I/O for these joins, or does Photon already cover that? for join-heavy workloads on UUIDv4, is LC doing anything extra?
4. Where LC makes sense:
We have other columns that are high-cardinality but naturally ordered—like event timestamps (or maybe UUIDv7 in the future). LC on those should co-locate ranges and improve both filters and joins. Should we focus LC on timestamp or UUIDv7 instead, and just rely on Photon for UUIDv4?
Would love to hear any real-world experiences or best practices. Thanks!

2 REPLIES 2

SP_6721
Contributor

Hi @korasino 

  • Liquid Clustering (LC) tightens file-level min/max stats on UUIDv4, but since Photon already handles dynamic pruning and data skipping using bloom-style filters and table stats, LC adds little to no benefit for point lookups (WHERE uuid = ...) or joins.
  • Because UUIDv4 values are random, LC distributes data evenly across files, which can actually hurt clustering on more useful columns like, timestamps reducing performance for time-based queries.
  • Photon also handles join filtering efficiently, so LC on UUIDv4 doesn’t help reduce shuffle or I/O further in join-heavy workloads.
  • Instead, LC is best used on naturally ordered columns like event timestamps or UUIDv7, where it can meaningfully improve query performance. For UUIDv4, relying on Photon alone is typically the better approach.

korasino
New Contributor II

Hey, thanks for the reply. Could you share some documentation links around those bullet points in your answer? thanks!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now