<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Photon not used for the filter step (falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150595#M53479</link>
    <description>&lt;P&gt;We have a custom python notebook used to handle data loading. In this case, it's for a full overwrite of specific partitions. The notebook determines columns to use for the update based on incoming data. It creates a replace condition like this:&amp;nbsp;&lt;SPAN&gt;replaceCondition : (sales_date sales_date,integration_store_key integration_store_key) IN ((to_date('2026-03-10'),'ABOUTYOU_115CH_TH_WHOL_4054986000000'),(to_date('2026-03-10'),'DE_PCNORD_CK_WHOL_4043654007816'), (...).&amp;nbsp; &amp;nbsp;By execution time, it looks like this:&amp;nbsp; inset(named_struct('sales_date', sales_date, 'integration_store_key', integration_store_key))&amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;During the execution, Photon is not engaged for the filter but rather falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM. Even though the target table is liquid clustered on the two columns, no data skipping happens and a full scan is executed at a huge cost (over a billion rows read to update a few thousand).&amp;nbsp; &amp;nbsp;&lt;STRONG&gt;Zero row group pruning: Rows skipped via stats filtering = 0, Row groups skipped via stats filtering = 0, Data filters - row groups filtered = 0&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Queery Profile shows:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Graph 0, photonExplain:&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;reason: UNIMPLEMENTED_OPERATOR&lt;SPAN&gt;params: OPERATOR = OverwriteByExpressionExecV1 [...]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;&lt;STRONG&gt;Graph 1, the filter node:&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;tag: FILTER_EXEC          ← JVM operator, not PHOTON_FILTER_EXEC&lt;SPAN&gt;insightIds: ['99716c5c...']&lt;SPAN&gt;condition: inset(named_struct('sales_date', ..., 'integration_store_key', ...))&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;And immediately above it:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;tag: COLUMNAR_TO_ROW_EXEC ← data leaving Photon back to JVM&lt;SPAN&gt;tag: PHOTON_PROJECT_EXEC  ← Photon stops here&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;databricks documentation states:&amp;nbsp;"getStatsColumnOpt filters out &lt;STRONG&gt;non-leaf StructType columns as they lack statistics and &lt;STRONG&gt;skipping predicates can't use them.".&amp;nbsp; In my&amp;nbsp;case, sales_date and integration_store_key are &lt;STRONG&gt;not themselves struct columns — they're scalar columns being &lt;STRONG&gt;wrapped into a struct by the named_struct(...) expression that inset() generates.&amp;nbsp; It seems that because of the columns being wrapped up&amp;nbsp; in a struct, statistics are not available and it defaults to JVM. Is it correct to say that: " skipping works on &lt;STRONG&gt;per-column min/max stats, and the inset(named_struct(...)) predicate is not decomposable into per-column comparisons the skipping engine can evaluate." ? Can anyone confirm the reason Photon doesn't work in my situation?&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 11 Mar 2026 13:58:32 GMT</pubDate>
    <dc:creator>ItalSess_5094</dc:creator>
    <dc:date>2026-03-11T13:58:32Z</dc:date>
    <item>
      <title>Photon not used for the filter step (falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM)</title>
      <link>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150595#M53479</link>
      <description>&lt;P&gt;We have a custom python notebook used to handle data loading. In this case, it's for a full overwrite of specific partitions. The notebook determines columns to use for the update based on incoming data. It creates a replace condition like this:&amp;nbsp;&lt;SPAN&gt;replaceCondition : (sales_date sales_date,integration_store_key integration_store_key) IN ((to_date('2026-03-10'),'ABOUTYOU_115CH_TH_WHOL_4054986000000'),(to_date('2026-03-10'),'DE_PCNORD_CK_WHOL_4043654007816'), (...).&amp;nbsp; &amp;nbsp;By execution time, it looks like this:&amp;nbsp; inset(named_struct('sales_date', sales_date, 'integration_store_key', integration_store_key))&amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;During the execution, Photon is not engaged for the filter but rather falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM. Even though the target table is liquid clustered on the two columns, no data skipping happens and a full scan is executed at a huge cost (over a billion rows read to update a few thousand).&amp;nbsp; &amp;nbsp;&lt;STRONG&gt;Zero row group pruning: Rows skipped via stats filtering = 0, Row groups skipped via stats filtering = 0, Data filters - row groups filtered = 0&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Queery Profile shows:&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Graph 0, photonExplain:&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;reason: UNIMPLEMENTED_OPERATOR&lt;SPAN&gt;params: OPERATOR = OverwriteByExpressionExecV1 [...]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;&lt;STRONG&gt;Graph 1, the filter node:&lt;/STRONG&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;tag: FILTER_EXEC          ← JVM operator, not PHOTON_FILTER_EXEC&lt;SPAN&gt;insightIds: ['99716c5c...']&lt;SPAN&gt;condition: inset(named_struct('sales_date', ..., 'integration_store_key', ...))&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;And immediately above it:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;tag: COLUMNAR_TO_ROW_EXEC ← data leaving Photon back to JVM&lt;SPAN&gt;tag: PHOTON_PROJECT_EXEC  ← Photon stops here&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;databricks documentation states:&amp;nbsp;"getStatsColumnOpt filters out &lt;STRONG&gt;non-leaf StructType columns as they lack statistics and &lt;STRONG&gt;skipping predicates can't use them.".&amp;nbsp; In my&amp;nbsp;case, sales_date and integration_store_key are &lt;STRONG&gt;not themselves struct columns — they're scalar columns being &lt;STRONG&gt;wrapped into a struct by the named_struct(...) expression that inset() generates.&amp;nbsp; It seems that because of the columns being wrapped up&amp;nbsp; in a struct, statistics are not available and it defaults to JVM. Is it correct to say that: " skipping works on &lt;STRONG&gt;per-column min/max stats, and the inset(named_struct(...)) predicate is not decomposable into per-column comparisons the skipping engine can evaluate." ? Can anyone confirm the reason Photon doesn't work in my situation?&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 11 Mar 2026 13:58:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150595#M53479</guid>
      <dc:creator>ItalSess_5094</dc:creator>
      <dc:date>2026-03-11T13:58:32Z</dc:date>
    </item>
    <item>
      <title>Re: Photon not used for the filter step (falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM)</title>
      <link>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150669#M53494</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/39741"&gt;@ItalSess_5094&lt;/a&gt;&amp;nbsp;, I did some digging and would like to share some helpful hints.&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;What you are seeing is mostly explained by two things.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;1. Unsupported operator in Photon&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;The query profile shows:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;reason: UNIMPLEMENTED_OPERATOR&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;OPERATOR = OverwriteByExpressionExecV1&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p3"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;That is the big clue.&lt;/P&gt;
&lt;P class="p1"&gt;What it means is that the overwrite-by-expression portion of the plan — the part implementing your &lt;SPAN class="s2"&gt;replaceWhere&lt;/SPAN&gt; / &lt;SPAN class="s2"&gt;replaceCondition&lt;/SPAN&gt; logic — is not supported by Photon for this particular query shape. Once that happens, the plan has to fall back.&lt;/P&gt;
&lt;P class="p1"&gt;So the execution path ends up looking something like this:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;PHOTON_PROJECT_EXEC&lt;SPAN class="s1"&gt; runs&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;then &lt;SPAN class="s1"&gt;COLUMNAR_TO_ROW_EXEC&lt;/SPAN&gt; converts columnar batches back into rows&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;then &lt;SPAN class="s1"&gt;FILTER_EXEC&lt;/SPAN&gt; runs on the JVM side&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;So Photon not handling the filter here is expected once that operator shows up. This is not just a stats issue. The fallback is already baked into the plan because of the unsupported overwrite operator.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;2. Why stats and data skipping are not helping&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;This part comes down to how Delta data skipping works.&lt;/P&gt;
&lt;P class="p1"&gt;Delta skipping is based on per-column min/max statistics on leaf columns. It works best when the predicate is expressed directly against those base columns.&lt;/P&gt;
&lt;P class="p1"&gt;Your predicate is effectively shaped like this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;WHERE named_struct('sales_date', sales_date,
                   'integration_store_key', integration_store_key)
      IN (struct('2026-03-10', 'ABOUTYOU_...'),
          struct('2026-03-10', 'DE_PCNORD_...'),
          ...)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;That is not a simple predicate on individual columns. It is a struct-valued comparison against a list of structs.&lt;/P&gt;
&lt;P class="p1"&gt;And that distinction matters.&lt;/P&gt;
&lt;P class="p1"&gt;Even though &lt;SPAN class="s1"&gt;sales_date&lt;/SPAN&gt; and &lt;SPAN class="s1"&gt;integration_store_key&lt;/SPAN&gt; are scalar columns in the table, the optimizer is not really seeing them as standalone filter predicates anymore. By the time it evaluates the condition, they are wrapped inside a &lt;SPAN class="s1"&gt;named_struct(...)&lt;/SPAN&gt; expression.&lt;/P&gt;
&lt;P class="p1"&gt;Today, that expression is not being decomposed back into independent column-level comparisons. So the engine cannot cleanly turn it into something like:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;sales_date = ...
AND integration_store_key IN (...)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;Because of that:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;data skipping cannot map the predicate cleanly to the min/max stats on &lt;SPAN class="s1"&gt;sales_date&lt;/SPAN&gt; and &lt;SPAN class="s1"&gt;integration_store_key&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Photon also does not have an implementation for this overwrite/filter pattern when expressed as a struct-based condition&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;So yes — your intuition is basically correct.&lt;/P&gt;
&lt;P class="p1"&gt;What is happening here is:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;skipping relies on per-column stats&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;named_struct(...) IN (...)&lt;/SPAN&gt; is not being broken back down into per-column predicates&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Photon does not support &lt;SPAN class="s1"&gt;OverwriteByExpressionExecV1&lt;/SPAN&gt; for this query shape&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;so the engine falls back to JVM row-based operators (&lt;SPAN class="s1"&gt;COLUMNAR_TO_ROW_EXEC&lt;/SPAN&gt; followed by &lt;SPAN class="s1"&gt;FILTER_EXEC&lt;/SPAN&gt;)&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;and as a result, you end up with zero row-group pruning&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Practical workaround&lt;/P&gt;
&lt;P class="p1"&gt;If you want Photon and data skipping to have a real chance to engage, the best move is to rewrite the overwrite condition using direct column predicates rather than wrapping the filter keys in a struct.&lt;/P&gt;
&lt;P class="p1"&gt;Conceptually, you want something shaped more like this:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;WHERE sales_date = DATE '2026-03-10'
  AND integration_store_key IN (
        'ABOUTYOU_115CH_TH_WHOL_4054986000000',
        'DE_PCNORD_CK_WHOL_4043654007816',
        ...
      )&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;Or the equivalent in the DataFrame API / &lt;SPAN class="s1"&gt;replaceWhere&lt;/SPAN&gt; expression.&lt;/P&gt;
&lt;P class="p1"&gt;The key idea is simple: keep the predicate expressed directly on the underlying columns, and avoid constructing a struct inside the filter itself.&lt;/P&gt;
&lt;P class="p1"&gt;That gives the optimizer a much cleaner path and improves the odds that both data skipping and Photon can do something useful.&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p1"&gt;Hope this helps, Louis.&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2026 10:40:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150669#M53494</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-03-12T10:40:17Z</dc:date>
    </item>
    <item>
      <title>Re: Photon not used for the filter step (falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM)</title>
      <link>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150786#M53515</link>
      <description>&lt;P&gt;Thank Louis. That confirms what I thought. The difficulty lies in creating the complex replacecondition we need without ending up with a struct. But we will figure it out.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Mar 2026 11:46:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150786#M53515</guid>
      <dc:creator>ItalSess_5094</dc:creator>
      <dc:date>2026-03-13T11:46:20Z</dc:date>
    </item>
    <item>
      <title>Re: Photon not used for the filter step (falls back to COLUMNAR_TO_ROW → FILTER_EXEC in JVM)</title>
      <link>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150799#M53520</link>
      <description>&lt;P&gt;Thanks for the feedback&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/39741"&gt;@ItalSess_5094&lt;/a&gt;&amp;nbsp;, do me a favor and click on "Accept as Solution" if you are satisified with my response. It will be helpful to others here in the community.&amp;nbsp; Thanks, Louis.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Mar 2026 12:34:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/photon-not-used-for-the-filter-step-falls-back-to-columnar-to/m-p/150799#M53520</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-03-13T12:34:05Z</dc:date>
    </item>
  </channel>
</rss>

