Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @ItalSess_5094 , I did some digging and would like to share some helpful hints. 

What you are seeing is mostly explained by two things.

1. Unsupported operator in Photon

The query profile shows:

  • reason: UNIMPLEMENTED_OPERATOR

  • OPERATOR = OverwriteByExpressionExecV1

 

That is the big clue.

What it means is that the overwrite-by-expression portion of the plan — the part implementing your replaceWhere / replaceCondition logic — is not supported by Photon for this particular query shape. Once that happens, the plan has to fall back.

So the execution path ends up looking something like this:

  • PHOTON_PROJECT_EXEC runs

  • then COLUMNAR_TO_ROW_EXEC converts columnar batches back into rows

  • then FILTER_EXEC runs on the JVM side

So Photon not handling the filter here is expected once that operator shows up. This is not just a stats issue. The fallback is already baked into the plan because of the unsupported overwrite operator.

2. Why stats and data skipping are not helping

This part comes down to how Delta data skipping works.

Delta skipping is based on per-column min/max statistics on leaf columns. It works best when the predicate is expressed directly against those base columns.

Your predicate is effectively shaped like this:

WHERE named_struct('sales_date', sales_date,
                   'integration_store_key', integration_store_key)
      IN (struct('2026-03-10', 'ABOUTYOU_...'),
          struct('2026-03-10', 'DE_PCNORD_...'),
          ...)

That is not a simple predicate on individual columns. It is a struct-valued comparison against a list of structs.

And that distinction matters.

Even though sales_date and integration_store_key are scalar columns in the table, the optimizer is not really seeing them as standalone filter predicates anymore. By the time it evaluates the condition, they are wrapped inside a named_struct(...) expression.

Today, that expression is not being decomposed back into independent column-level comparisons. So the engine cannot cleanly turn it into something like:

sales_date = ...
AND integration_store_key IN (...)

Because of that:

  • data skipping cannot map the predicate cleanly to the min/max stats on sales_date and integration_store_key

  • Photon also does not have an implementation for this overwrite/filter pattern when expressed as a struct-based condition

So yes — your intuition is basically correct.

What is happening here is:

  • skipping relies on per-column stats

  • named_struct(...) IN (...) is not being broken back down into per-column predicates

  • Photon does not support OverwriteByExpressionExecV1 for this query shape

  • so the engine falls back to JVM row-based operators (COLUMNAR_TO_ROW_EXEC followed by FILTER_EXEC)

  • and as a result, you end up with zero row-group pruning

Practical workaround

If you want Photon and data skipping to have a real chance to engage, the best move is to rewrite the overwrite condition using direct column predicates rather than wrapping the filter keys in a struct.

Conceptually, you want something shaped more like this:

WHERE sales_date = DATE '2026-03-10'
  AND integration_store_key IN (
        'ABOUTYOU_115CH_TH_WHOL_4054986000000',
        'DE_PCNORD_CK_WHOL_4043654007816',
        ...
      )

Or the equivalent in the DataFrame API / replaceWhere expression.

The key idea is simple: keep the predicate expressed directly on the underlying columns, and avoid constructing a struct inside the filter itself.

That gives the optimizer a much cleaner path and improves the odds that both data skipping and Photon can do something useful.

 

Hope this helps, Louis.

View solution in original post