topic Re: Photon and Predictive I/O vs. Liquid Clustering in Data Engineering

Photon and Predictive I/O vs. Liquid Clustering

korasino — Mon, 02 Jun 2025 10:11:07 GMT

Hi 🙂

Quick question about optimizing our Delta tables. Photon and Predictive I/O vs. Liquid Clustering (LC).
We have UUIDv4 columns (random, high-cardinality) used in both WHERE uuid = … filters and joins. From what I understand Photon (on Serverless warehouses) automatically does dynamic file pruning - building dynamic bloom style filters while querying and using table statistics for data skipping for point lookups (`WHERE uuid = ...`).
So:
1. LC vs Photon on a UUIDv4:
LC tightens min/max per file on UUIDv4, but Photon also does dynamic pruning already and skips blocks for WHERE uuid = … or joins (?). Is LC on UUIDv4 basically redundant since Photon handles the skipping? Does LC add any extra performance for point lookups or joins on UUIDv4?
2. Could LC on UUIDv4 hurt
UUIDv4 values are random, so LC would distribute those evenly - does this mean that it could actually hurt the rest of our optimization columns (like tstamps, grouping ids)
3. Joins on UUIDv4 with Photon:
When joining two large tables on a random UUID key, Photon will skip non-matching file blocks. Does LC’s min/max on UUIDv4 actually reduce shuffle or I/O for these joins, or does Photon already cover that? for join-heavy workloads on UUIDv4, is LC doing anything extra?
4. Where LC makes sense:
We have other columns that are high-cardinality but naturally ordered—like event timestamps (or maybe UUIDv7 in the future). LC on those should co-locate ranges and improve both filters and joins. Should we focus LC on timestamp or UUIDv7 instead, and just rely on Photon for UUIDv4?
Would love to hear any real-world experiences or best practices. Thanks!

Re: Photon and Predictive I/O vs. Liquid Clustering

SP_6721 — Wed, 04 Jun 2025 11:36:25 GMT

Hi @korasino

Liquid Clustering (LC) tightens file-level min/max stats on UUIDv4, but since Photon already handles dynamic pruning and data skipping using bloom-style filters and table stats, LC adds little to no benefit for point lookups (WHERE uuid = ...) or joins.
Because UUIDv4 values are random, LC distributes data evenly across files, which can actually hurt clustering on more useful columns like, timestamps reducing performance for time-based queries.
Photon also handles join filtering efficiently, so LC on UUIDv4 doesn’t help reduce shuffle or I/O further in join-heavy workloads.
Instead, LC is best used on naturally ordered columns like event timestamps or UUIDv7, where it can meaningfully improve query performance. For UUIDv4, relying on Photon alone is typically the better approach.

Re: Photon and Predictive I/O vs. Liquid Clustering

korasino — Tue, 10 Jun 2025 09:21:25 GMT

Hey, thanks for the reply. Could you share some documentation links around those bullet points in your answer? thanks!