<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>article Incremental REPLACE WHERE Flows Brings Targeted Refreshes to SDP and DBSQL in Technical Blog</title>
    <link>https://community.databricks.com/t5/technical-blog/incremental-replace-where-flows-brings-targeted-refreshes-to-sdp/ba-p/159057</link>
    <description>&lt;P&gt;&lt;SPAN&gt;Lakeflow Spark Declarative Pipelines (SDP) has long supported two ways to maintain a table: a &lt;/SPAN&gt;&lt;STRONG&gt;streaming flow&lt;/STRONG&gt;&lt;SPAN&gt; for incremental ingestion&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;and a&amp;nbsp;&lt;STRONG&gt;materialized view&lt;/STRONG&gt;&amp;nbsp;for a fully derived&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;result that is always exact. Both are great fits for the shapes they were designed for. What's been missing is a third pattern, the one many batch ETL teams already work in: &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;"on this scheduled run, refresh just the last 7 days; leave the rest of the table alone."&lt;/SPAN&gt;&lt;/I&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;That's what incremental &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;flows bring to SDP and DBSQL — a third option, a &lt;/SPAN&gt;&lt;STRONG&gt;batch processing flow&lt;/STRONG&gt;&lt;SPAN&gt; with automatic incrementalization. You declare the slice each run owns with a single predicate, and the pipeline refreshes exactly that slice on every update. The same primitive also supports a one-time predicate override, so the occasional fix-a-column-and-backfill case uses the same mechanism, with no need to recompute the full table.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And it's more than the &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;you'd write in Databricks SQL. Incremental &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;flows are powered by &lt;/SPAN&gt;&lt;STRONG&gt;Enzyme&lt;/STRONG&gt;&lt;SPAN&gt;, Databricks' incremental computation engine: instead of overwriting all data that matches the predicate, the engine only rewrites the files with rows that changed since the last update. With Enzyme, refreshes are both faster and cheaper than a naive partition overwrite – over &lt;/SPAN&gt;&lt;STRONG&gt;3.4x &lt;/STRONG&gt;&lt;SPAN&gt;faster and&lt;/SPAN&gt;&lt;STRONG&gt; 2.5x &lt;/STRONG&gt;&lt;SPAN&gt;cheaper in an incremental TPC-DI benchmark.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt; is available now in Beta — and it is available in both Spark Lakeflow Spark Declarative Pipelines and &lt;/SPAN&gt;&lt;SPAN&gt;DBSQL. You can try it today in the&lt;/SPAN&gt;&lt;SPAN&gt; Databricks SQL Query Editor, running directly on your SQL warehouse.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fran_martin_0-1781538236592.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27815i11FCD1C53F0766A6/image-size/large?v=v2&amp;amp;px=999" role="button" title="fran_martin_0-1781538236592.png" alt="fran_martin_0-1781538236592.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;STRONG&gt;The problems incremental REPLACE WHERE solves&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;The standing predicate — a condition baked into the pipeline definition that tells every scheduled run &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;"this is the slice you own"&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; — is the everyday case. It solves a few types of problems naturally, all the same shape: a slice you can name upfront that is the same every run.&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Rolling refresh.&lt;/STRONG&gt;&lt;SPAN&gt; A source continuously adds new rows and occasionally corrects older ones. You want each nightly run to refresh the last 7 days so corrections or late-arriving data get absorbed, and everything older — which is stable — left untouched. A standing &lt;CODE&gt;REPLACE_WHERE&amp;nbsp;order_date &amp;gt;= date_add(current_date(), -7)&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;becomes the pipeline's contract: every run rebuilds the trailing 7-day window and leaves the rest of the table as-is.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fran_martin_1-1781531405652.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27809iFFAD907CC0342A6D/image-size/large?v=v2&amp;amp;px=999" role="button" title="fran_martin_1-1781531405652.png" alt="fran_martin_1-1781531405652.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Building lasting tables from short-retention sources.&lt;/STRONG&gt;&lt;SPAN&gt; Your raw events table keeps only a few days of data, while the daily aggregates you derive from it need to live permanently. A standing &lt;CODE&gt;REPLACE_WHERE&amp;nbsp;event_date &amp;gt;= date_add(current_date(), -3)&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;rebuilds the recent three days each run, and older days persist. A Materialized View would remove rows where the event date is older than 3, since the source no longer contains them.&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fran_martin_2-1781531405653.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27810i5810A4C30D0C4107/image-size/large?v=v2&amp;amp;px=999" role="button" title="fran_martin_2-1781531405653.png" alt="fran_martin_2-1781531405653.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Scoping dimension updates to recent facts only.&lt;/STRONG&gt;&lt;SPAN&gt; You enrich a fact table by joining a dimension and writing the dimension value into the fact row at processing time. When the dimension changes later (a user moves regions, a product gets reclassified), you don't want every historical fact row rewritten — purely a &lt;/SPAN&gt;&lt;STRONG&gt;cost&lt;/STRONG&gt;&lt;SPAN&gt; decision: rewriting five years of facts every time a dimension row updates is expensive. A standing &lt;CODE&gt;REPLACE_WHERE&amp;nbsp;order_date &amp;gt;= date_add(current_date(), -3)&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;scopes each refresh to the recent window, so only recent facts pick up the current dimension values, while older facts retain the values they were enriched with originally.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fran_martin_3-1781531405653.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27808iC89CEC049033F03E/image-size/large?v=v2&amp;amp;px=999" role="button" title="fran_martin_3-1781531405653.png" alt="fran_martin_3-1781531405653.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;What these share: the slice each run owns is knowable up front and the same every run. You declare it once in the pipeline definition, and every scheduled update honors it.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;STRONG&gt;Deep dive into REPLACE WHERE&lt;/STRONG&gt;&lt;/H2&gt;
&lt;H3&gt;&lt;STRONG&gt;1. Syntax overview&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;The syntax to use &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;is simple; the &lt;/SPAN&gt;&lt;STRONG&gt;predicate is the table's contract, not part of the &lt;SPAN&gt;&lt;CODE&gt;SELECT&lt;/CODE&gt;&lt;/SPAN&gt;.&lt;/STRONG&gt;&lt;SPAN&gt; You declare which slice this table owns once, at the top of the flow. SDP applies the predicate automatically — don't repeat it inside the body, and don't filter the source; SDP only reads the rows that match.&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;CREATE STREAMING TABLE silver_orders
FLOW REPLACE WHERE order_date &amp;gt;= date_add(current_date(), -7) BY NAME
SELECT
  order_id, order_date, region, product_id, customer_id,
  quantity, unit_price,
  CAST(quantity * unit_price AS DECIMAL(12,2)) AS revenue,
  _ingested_at
FROM bronze_orders;&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN&gt;To make this concrete, here's a self-contained three-layer example that you can copy and paste into a pipeline:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;I&gt;Source Data Generation:&lt;/I&gt;&lt;/STRONG&gt; &lt;I&gt;&lt;SPAN&gt;Run it into the SQL editor to generate source data&lt;/SPAN&gt;&lt;/I&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;CREATE SCHEMA IF NOT EXISTS &amp;lt;catalog&amp;gt;.&amp;lt;schema&amp;gt;;

USE CATALOG &amp;lt;catalog&amp;gt;;
USE SCHEMA &amp;lt;schema&amp;gt;;

CREATE TABLE IF NOT EXISTS bronze_orders (
  order_id STRING, order_date DATE, customer_id STRING, product_id STRING, region STRING,
  customer_segment STRING, customer_name STRING, product_name STRING, product_category STRING,
  list_price DECIMAL(10,2), quantity INT, unit_price DECIMAL(10,2),
  event_ts TIMESTAMP, _seq BIGINT, _ingested_at TIMESTAMP
)
CLUSTER BY (order_date)
TBLPROPERTIES (delta.enableRowTracking = true);

INSERT INTO bronze_orders
SELECT
  concat('ORD-', id),
  date_sub(current_date(), CAST(id % 90 AS INT)),
  concat('CUST-', lpad(CAST(id % 500 AS STRING), 5, '0')),
  concat('SKU-',  lpad(CAST(id % 25  AS STRING), 3, '0')),
  element_at(array('AMER','EMEA','APAC','LATAM'), CAST(id % 4 AS INT) + 1),
  element_at(array('Bronze','Silver','Gold','Platinum'), CAST(((id % 4) + CAST(id / 5000 AS INT)) % 4 AS INT) + 1),
  concat('Customer ', CAST(id % 500 AS INT)),
  concat('Product ',  CAST(id % 25  AS INT)),
  element_at(array('Electronics','Apparel','Home','Grocery','Sports'), CAST(id % 5 AS INT) + 1),
  CAST(20 + (((id % 25) + CAST(id / 5000 AS INT)) % 10) * 5 AS DECIMAL(10,2)),
  CAST((id % 9) + 1 AS INT),
  CAST(10 + (id % 190) AS DECIMAL(10,2)),
  current_timestamp(), id, current_timestamp()
FROM range(0, 2000);&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;&lt;I&gt;02_silver.sql:&lt;/I&gt;&lt;/STRONG&gt; &lt;I&gt;Replace the last 7 days only&lt;/I&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;CREATE STREAMING TABLE silver_orders
CLUSTER BY (order_date)
FLOW REPLACE WHERE order_date &amp;gt;= date_add(current_date(), -7) BY NAME
SELECT
  order_id,
  order_date,
  customer_id,
  product_id,
  region,
  quantity,
  unit_price,
  CAST(quantity * unit_price AS DECIMAL(12,2)) AS revenue,
  event_ts,
  _ingested_at
FROM bronze_orders;
&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;I&gt;03_gold.sql:&lt;/I&gt;&lt;/STRONG&gt;&lt;I&gt;&lt;SPAN&gt; Combines &lt;CODE&gt;AUTO CDC&lt;/CODE&gt;&amp;nbsp;for dimensions and &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&amp;nbsp;for the fact table&lt;/SPAN&gt;&lt;/I&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;CREATE OR REFRESH STREAMING TABLE dim_customer
COMMENT "SCD Type 1 customer dimension (AUTO CDC)."
FLOW AUTO CDC
FROM STREAM bronze_orders
KEYS (customer_id)
SEQUENCE BY _seq
COLUMNS customer_id, region, customer_segment, customer_name
STORED AS SCD TYPE 1;

CREATE OR REFRESH STREAMING TABLE dim_product
COMMENT "SCD Type 1 product dimension (AUTO CDC)."
FLOW AUTO CDC
FROM STREAM bronze_orders
KEYS (product_id)
SEQUENCE BY _seq
COLUMNS product_id, product_name, product_category, list_price
STORED AS SCD TYPE 1;

CREATE STREAMING TABLE daily_sales
CLUSTER BY (order_date)
FLOW REPLACE WHERE order_date &amp;gt;= date_add(current_date(), -7) BY NAME
SELECT
  order_date,
  customer_id,
  product_id,
  COUNT(*)                            AS order_count,
  SUM(quantity)                       AS units_sold,
  CAST(SUM(revenue) AS DECIMAL(18,2)) AS revenue_usd
FROM silver_orders
GROUP BY ALL;&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN&gt;As you can see, we created a&lt;/SPAN&gt;&lt;STRONG&gt; bronze table &lt;/STRONG&gt;&lt;SPAN&gt;that seeds orders inline (no external source needed), a&lt;/SPAN&gt;&lt;STRONG&gt; silver streaming table&lt;/STRONG&gt;&lt;SPAN&gt; fed by a &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;flow that refreshes only the last 7 days&lt;/SPAN&gt; &lt;SPAN&gt;on every update, and the downstream &lt;/SPAN&gt;&lt;STRONG&gt;gold dimensions and fact table&lt;/STRONG&gt;&lt;SPAN&gt;. In the gold layer, you can also see how &lt;CODE&gt;AUTO CDC&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp; and &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;both can be used in the same pipeline&lt;/SPAN&gt; &lt;SPAN&gt;for the creation of the dimensions and the fact table.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;You can learn more about &lt;CODE&gt;AUTO CDC&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;in our previous blogpost:&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://community.databricks.com/t5/technical-blog/from-150-lines-of-merge-into-to-7-lines-of-sql-auto-cdc-comes-to/ba-p/155355" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://community.databricks.com/t5/technical-blog/from-150-lines-of-merge-into-to-7-lines-of-sql-auto-cdc-comes-to/ba-p/155355&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;H3&gt;&lt;STRONG&gt;2. The magic under the hood: the Enzyme optimizer&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN&gt;This is what makes a &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;flow inside SDP different from the same operation written by hand in Spark.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;First, the baseline.&lt;/STRONG&gt;&lt;SPAN&gt; Every &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;update commits as a single Delta transaction that, mechanically: (1) deletes the rows matching the predicate, (2) recomputes the source query for the predicate range — the predicate is pushed down to the source read, so rows outside the window are never scanned — and (3) inserts the recomputed rows. That alone gives you two guarantees for free: the update is &lt;/SPAN&gt;&lt;STRONG&gt;atomic and recoverable&lt;/STRONG&gt;&lt;SPAN&gt; (readers never see a partial state; a failed run leaves the last good state intact), and it is &lt;/SPAN&gt;&lt;STRONG&gt;predicate-scoped&lt;/STRONG&gt;&lt;SPAN&gt; (rows outside the window are never touched — no file rewrite, no version churn).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Now compare this to doing it yourself in DBSQL.&lt;/STRONG&gt;&lt;SPAN&gt; Outside SDP, you would do something like this:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;INSERT INTO TABLE events REPLACE WHERE start_date &amp;gt;= (NOW() - INTERVAL 7 DAY) SELECT * FROM replace_data&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN&gt;This statement &lt;/SPAN&gt;&lt;STRONG&gt;always rewrites everything that matches the predicate&lt;/STRONG&gt;&lt;SPAN&gt;, on every run, whether or not the underlying data changed. Tell it "the last 7 days," and it recomputes and rewrites all seven days, full stop — full source recompute over the window plus a full file rewrite. Dynamic partition overwrite also only works at &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;partition&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; granularity and requires the table to be physically partitioned the right way.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;This is where Enzyme comes in.&lt;/STRONG&gt;&lt;SPAN&gt; The engine doesn't stop at "overwrite everything that matches." Enzyme inspects the data matching the predicate, recognizes that most of it hasn't changed since the last run, and only rewrites the files that contain rows that actually changed. Same one-line predicate, a fraction of the I/O.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="fran_martin_4-1781531405654.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27811i4295E441D4486E7D/image-size/large?v=v2&amp;amp;px=999" role="button" title="fran_martin_4-1781531405654.png" alt="fran_martin_4-1781531405654.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;And it’s a lot faster and cheaper too. &lt;/STRONG&gt;&lt;SPAN&gt;On an incremental TPC-DI benchmark, the incremental &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt; ran &lt;/SPAN&gt;&lt;STRONG&gt;3.4x &lt;/STRONG&gt;&lt;SPAN&gt;faster and&lt;/SPAN&gt;&lt;STRONG&gt; 2.5x &lt;/STRONG&gt;&lt;SPAN&gt;cheaper than a full &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;, with identical output.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;The difference is what happens inside the predicate. A regular &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;overwrites everything that matches the predicate, so a 7-day window means all seven days get rewritten on every run, even the rows that didn't change. The incremental version scopes the work further: within that same window, it rewrites &lt;/SPAN&gt;&lt;STRONG&gt;only the rows that actually changed&lt;/STRONG&gt;&lt;SPAN&gt; and leaves the rest in place. You get the same result, but you stop paying to overwrite data that was already correct.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;When incremental processing engages.&lt;/STRONG&gt;&lt;SPAN&gt; Enzyme isn't guaranteed on every run — and that's fine, the output is always correct either way; the only difference is how much work it takes. It engages when all of these hold: the &lt;/SPAN&gt;&lt;STRONG&gt;query shape is supported&lt;/STRONG&gt;&lt;SPAN&gt;; the predicate references &lt;/SPAN&gt;&lt;STRONG&gt;base columns&lt;/STRONG&gt;&lt;SPAN&gt; (Predicates on derived values, such as aggregate or window function outputs disables incremental refresh); the predicate is &lt;/SPAN&gt;&lt;STRONG&gt;deterministic&lt;/STRONG&gt;&lt;SPAN&gt; (temporal functions such as &lt;/SPAN&gt;&lt;SPAN&gt;current_date()&lt;/SPAN&gt;&lt;SPAN&gt; are allowed, &lt;/SPAN&gt;&lt;SPAN&gt;rand()&lt;/SPAN&gt;&lt;SPAN&gt; is not); &lt;/SPAN&gt;&lt;STRONG&gt;no external DML&lt;/STRONG&gt;&lt;SPAN&gt; modified the window since the last update; and the window doesn't extend into rows the previous predicate excluded. About the type of computation to use, it will just work&lt;/SPAN&gt;&lt;STRONG&gt; if you use DBSQL, &lt;/STRONG&gt;&lt;SPAN&gt;but make sure you use &lt;/SPAN&gt;&lt;STRONG&gt;serverless&lt;/STRONG&gt;&lt;SPAN&gt; compute for &lt;/SPAN&gt;&lt;STRONG&gt;Spark Declarative Pipelines&lt;/STRONG&gt;&lt;SPAN&gt;. Finally, it is worth mentioning that the first refresh is always a full computation; incremental begins on the second update.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Diagnosing a fallback.&lt;/STRONG&gt;&lt;SPAN&gt; If a condition isn't met, the engine falls back to a full recompute of the window (correct, just heavier) and records why in the pipeline event log's &lt;/SPAN&gt;&lt;SPAN&gt;planning_information&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/P&gt;
&lt;TABLE&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;Reason code&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;STRONG&gt;What happened&lt;/STRONG&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;EXTERNAL_CHANGE_IN_REPLACE_WINDOW&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;Something outside the pipeline modified rows inside the window since the last update.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;REPLACE_WHERE_NOT_DETERMINISTIC&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The current predicate uses non-deterministic expressions.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;PRIOR_REPLACE_WHERE_NOT_DETERMINISTIC&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The previous run's predicate wasn't deterministic.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;UNSUPPORTED_REPLACE_WHERE_PREDICATE&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The query shape or predicate isn't on the fast path (or this was a predicate override).&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;PRIOR_REPLACE_WHERE_PREDICATE_MISSING&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The previous predicate is missing from the provenance.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;STALE_TARGET_DATA_IN_WINDOW&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The replace where predicated changed across updates, and the new window includes rows that were excluded by the previous predicate.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;CANNOT_PUSH_PREDICATE&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The replace where predicate could not be pushed down to at least one source.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;REPLACE_WHERE_BACKFILL&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;TD&gt;
&lt;P&gt;&lt;SPAN&gt;The update is a backfill — a one-off predicate override used to populate historical data.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;&lt;SPAN&gt;To stay on the fast path: predicate on base columns with a moving lower bound; include the predicate column in the &lt;CODE&gt;GROUP BY&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;for aggregations and in the join key for joins; and don't write to the target from outside the pipeline inside the replace window.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;STRONG&gt;The predicate override: a one-time update for backfills and corrections&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Most of the time, the pipeline runs normally, but occasionally a single update needs a different scope. A few common examples:&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI aria-level="1"&gt;
&lt;P&gt;&lt;STRONG&gt;Targeted corrections. &lt;/STRONG&gt;&lt;SPAN&gt;A conversion bug caused revenue for &lt;CODE&gt;country = 'CA'&lt;/CODE&gt;&amp;nbsp;to be incorrect over&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;the last 30 days. The standing 7-day predicate won't reach back that far, and you don't want to recompute everything else. A one-time update can rewrite just &lt;CODE&gt;country = 'CA'&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;for that window, then the pipeline returns to its standing predicate.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;LI-WRAPPER&gt;&lt;/LI-WRAPPER&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;
&lt;P&gt;&lt;STRONG&gt;Backfills: &lt;/STRONG&gt;&lt;SPAN&gt;You add a new metric to a daily aggregate or change how a column handles an edge case. From now on, every run uses the new logic — but you also want it applied to a chosen historical range. Or: you keep the standing window small while validating the transformation, then load the full history once you're confident.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;Both are &lt;/SPAN&gt;&lt;STRONG&gt;predicate overrides&lt;/STRONG&gt;&lt;SPAN&gt; — a one-time run where the standing predicate is swapped for a different expression, for that update only. The pipeline definition remains untouched; the next scheduled run reverts to the standing predicate.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In Spark Declarative Pipelines, &lt;/STRONG&gt;&lt;SPAN&gt;the override currently runs through the API. The docs include a helper function with examples:&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;STRONG&gt;In DBSQL, &lt;CODE&gt;REFRESH&lt;/CODE&gt;&lt;/STRONG&gt;&amp;nbsp;overrides are coming soon; in the meantime,&amp;nbsp;&lt;STRONG&gt;arbitrary DML&amp;nbsp;&lt;/STRONG&gt;is supported: you can &lt;CODE&gt;INSERT&lt;/CODE&gt;, &lt;CODE&gt;SELECT&lt;/CODE&gt;, or &lt;CODE&gt;DELETE&lt;/CODE&gt; against the table from&amp;nbsp;&lt;EM&gt;outside&lt;/EM&gt;&amp;nbsp;the pipeline, like for any other streaming table.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;STRONG&gt;How to get started&lt;/STRONG&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;flows are in &lt;/SPAN&gt;&lt;STRONG&gt;Beta&lt;/STRONG&gt;&lt;SPAN&gt;. To wire one up: set your pipeline channel to &lt;/SPAN&gt;&lt;SPAN&gt;PREVIEW&lt;/SPAN&gt;&lt;SPAN&gt;, use Unity Catalog, and run on &lt;/SPAN&gt;&lt;STRONG&gt;DBSQL&lt;/STRONG&gt;&lt;SPAN&gt; or &lt;/SPAN&gt;&lt;STRONG&gt;serverless compute&lt;/STRONG&gt;&lt;SPAN&gt; (required for incremental processing via Enzyme).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;A good way to start is to take the self-contained sample provided above, copy it to a new Spark Declartive Pipeline, and run it.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;A few limitations to know:&lt;/STRONG&gt;&lt;SPAN&gt; A table can have only one &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;flow (and can't also be an &lt;CODE&gt;AUTO CDC&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;or append target); expectations aren't supported on &lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;targets (set them upstream); and &lt;CODE&gt;FULL REFRESH&lt;/CODE&gt;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;is destructive — it re-runs the flow with its &lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt;standing&lt;/SPAN&gt;&lt;/I&gt;&lt;SPAN&gt; predicate, so any backfilled history is lost (set &lt;/SPAN&gt;&lt;SPAN&gt;pipelines.reset.allowed = false&lt;/SPAN&gt;&lt;SPAN&gt; to block it).&lt;/SPAN&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;REPLACE WHERE in DBSQL &lt;/STRONG&gt;&lt;SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/dbsql/flows-replace-where" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/ldp/dbsql/flows-replace-where&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;REPLACE WHERE flows for Lakeflow Declarative Pipelines&lt;/STRONG&gt;&lt;SPAN&gt; — &lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/flows-replace-where" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://docs.databricks.com/aws/en/ldp/flows-replace-where&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Backfilling historical data&lt;/STRONG&gt;&lt;SPAN&gt; — &lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Incremental refresh &amp;amp; supported query shapes&lt;/STRONG&gt;&lt;SPAN&gt; — &lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/flows-replace-where#best-practices-for-incremental-refresh" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://docs.databricks.com/aws/en/ldp/flows-replace-where#best-practices-for-incremental-refresh&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="font-weight: 400;" aria-level="1"&gt;&lt;STRONG&gt;Limitations &lt;/STRONG&gt;&lt;SPAN&gt;— &lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/ldp/flows-replace-where#limitations" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;https://docs.databricks.com/aws/en/ldp/flows-replace-where#limitations&lt;/SPAN&gt;&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN&gt;If you have feedback or questions, drop them in the comments below.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 16 Jun 2026 16:44:13 GMT</pubDate>
    <dc:creator>fran_martin</dc:creator>
    <dc:date>2026-06-16T16:44:13Z</dc:date>
    <item>
      <title>Incremental REPLACE WHERE Flows Brings Targeted Refreshes to SDP and DBSQL</title>
      <link>https://community.databricks.com/t5/technical-blog/incremental-replace-where-flows-brings-targeted-refreshes-to-sdp/ba-p/159057</link>
      <description>&lt;P&gt;SDP has always given you two ways to keep a table fresh: a streaming flow for incremental ingestion, or a materialized view that recomputes an exact result. But most batch ETL teams live in a third world entirely: "refresh just the last 7 days, leave everything else alone." Incremental &lt;SPAN&gt;&lt;CODE&gt;REPLACE WHERE&lt;/CODE&gt;&lt;/SPAN&gt; flows finally make that a first-class pattern, with Enzyme doing the incremental math, so refreshes run over 3.4x faster and 2.5x cheaper than a naive partition overwrite.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jun 2026 16:44:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/technical-blog/incremental-replace-where-flows-brings-targeted-refreshes-to-sdp/ba-p/159057</guid>
      <dc:creator>fran_martin</dc:creator>
      <dc:date>2026-06-16T16:44:13Z</dc:date>
    </item>
  </channel>
</rss>

