- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
4 weeks ago
We have a custom python notebook used to handle data loading. In this case, it's for a full overwrite of specific partitions. The notebook determines columns to use for the update based on incoming data. It creates a replace condition like this: replaceCondition : (sales_date sales_date,integration_store_key integration_store_key) IN ((to_date('2026-03-10'),'ABOUTYOU_115CH_TH_WHOL_4054986000000'),(to_date('2026-03-10'),'DE_PCNORD_CK_WHOL_4043654007816'), (...). By execution time, it looks like this: inset(named_struct('sales_date', sales_date, 'integration_store_key', integration_store_key))
During the execution, Photon is not engaged for the filter but rather falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM. Even though the target table is liquid clustered on the two columns, no data skipping happens and a full scan is executed at a huge cost (over a billion rows read to update a few thousand). Zero row group pruning: Rows skipped via stats filtering = 0, Row groups skipped via stats filtering = 0, Data filters - row groups filtered = 0
Queery Profile shows:
Graph 0, photonExplain:
reason: UNIMPLEMENTED_OPERATORparams: OPERATOR = OverwriteByExpressionExecV1 [...]Graph 1, the filter node:
tag: FILTER_EXEC ← JVM operator, not PHOTON_FILTER_EXECinsightIds: ['99716c5c...']condition: inset(named_struct('sales_date', ..., 'integration_store_key', ...))And immediately above it:
tag: COLUMNAR_TO_ROW_EXEC ← data leaving Photon back to JVMtag: PHOTON_PROJECT_EXEC ← Photon stops heredatabricks documentation states: "getStatsColumnOpt filters out non-leaf StructType columns as they lack statistics and skipping predicates can't use them.". In my case, sales_date and integration_store_key are not themselves struct columns — they're scalar columns being wrapped into a struct by the named_struct(...) expression that inset() generates. It seems that because of the columns being wrapped up in a struct, statistics are not available and it defaults to JVM. Is it correct to say that: " skipping works on per-column min/max stats, and the inset(named_struct(...)) predicate is not decomposable into per-column comparisons the skipping engine can evaluate." ? Can anyone confirm the reason Photon doesn't work in my situation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Greetings @ItalSess_5094 , I did some digging and would like to share some helpful hints.
What you are seeing is mostly explained by two things.
1. Unsupported operator in Photon
The query profile shows:
-
reason: UNIMPLEMENTED_OPERATOR
-
OPERATOR = OverwriteByExpressionExecV1
That is the big clue.
What it means is that the overwrite-by-expression portion of the plan — the part implementing your replaceWhere / replaceCondition logic — is not supported by Photon for this particular query shape. Once that happens, the plan has to fall back.
So the execution path ends up looking something like this:
-
PHOTON_PROJECT_EXEC runs
-
then COLUMNAR_TO_ROW_EXEC converts columnar batches back into rows
-
then FILTER_EXEC runs on the JVM side
So Photon not handling the filter here is expected once that operator shows up. This is not just a stats issue. The fallback is already baked into the plan because of the unsupported overwrite operator.
2. Why stats and data skipping are not helping
This part comes down to how Delta data skipping works.
Delta skipping is based on per-column min/max statistics on leaf columns. It works best when the predicate is expressed directly against those base columns.
Your predicate is effectively shaped like this:
WHERE named_struct('sales_date', sales_date,
'integration_store_key', integration_store_key)
IN (struct('2026-03-10', 'ABOUTYOU_...'),
struct('2026-03-10', 'DE_PCNORD_...'),
...)
That is not a simple predicate on individual columns. It is a struct-valued comparison against a list of structs.
And that distinction matters.
Even though sales_date and integration_store_key are scalar columns in the table, the optimizer is not really seeing them as standalone filter predicates anymore. By the time it evaluates the condition, they are wrapped inside a named_struct(...) expression.
Today, that expression is not being decomposed back into independent column-level comparisons. So the engine cannot cleanly turn it into something like:
sales_date = ...
AND integration_store_key IN (...)
Because of that:
-
data skipping cannot map the predicate cleanly to the min/max stats on sales_date and integration_store_key
-
Photon also does not have an implementation for this overwrite/filter pattern when expressed as a struct-based condition
So yes — your intuition is basically correct.
What is happening here is:
-
skipping relies on per-column stats
-
named_struct(...) IN (...) is not being broken back down into per-column predicates
-
Photon does not support OverwriteByExpressionExecV1 for this query shape
-
so the engine falls back to JVM row-based operators (COLUMNAR_TO_ROW_EXEC followed by FILTER_EXEC)
-
and as a result, you end up with zero row-group pruning
Practical workaround
If you want Photon and data skipping to have a real chance to engage, the best move is to rewrite the overwrite condition using direct column predicates rather than wrapping the filter keys in a struct.
Conceptually, you want something shaped more like this:
WHERE sales_date = DATE '2026-03-10'
AND integration_store_key IN (
'ABOUTYOU_115CH_TH_WHOL_4054986000000',
'DE_PCNORD_CK_WHOL_4043654007816',
...
)
Or the equivalent in the DataFrame API / replaceWhere expression.
The key idea is simple: keep the predicate expressed directly on the underlying columns, and avoid constructing a struct inside the filter itself.
That gives the optimizer a much cleaner path and improves the odds that both data skipping and Photon can do something useful.
Hope this helps, Louis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Thank Louis. That confirms what I thought. The difficulty lies in creating the complex replacecondition we need without ending up with a struct. But we will figure it out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Thanks for the feedback @ItalSess_5094 , do me a favor and click on "Accept as Solution" if you are satisified with my response. It will be helpful to others here in the community. Thanks, Louis.