Databricks Community

fran_martin

Lakeflow Spark Declarative Pipelines (SDP) has long supported two ways to maintain a table: a streaming flow for incremental ingestion and a materialized view for a fully derived result that is always exact. Both are great fits for the shapes they were designed for. What's been missing is a third pattern, the one many batch ETL teams already work in: "on this scheduled run, refresh just the last 7 days; leave the rest of the table alone."

That's what incremental REPLACE WHERE flows bring to SDP and DBSQL — a third option, a batch processing flow with automatic incrementalization. You declare the slice each run owns with a single predicate, and the pipeline refreshes exactly that slice on every update. The same primitive also supports a one-time predicate override, so the occasional fix-a-column-and-backfill case uses the same mechanism, with no need to recompute the full table.

And it's more than the REPLACE WHERE you'd write in Databricks SQL. Incremental REPLACE WHERE flows are powered by Enzyme, Databricks' incremental computation engine: instead of overwriting all data that matches the predicate, the engine only rewrites the files with rows that changed since the last update. With Enzyme, refreshes are both faster and cheaper than a naive partition overwrite – over 3.4x faster and 2.5x cheaper in an incremental TPC-DI benchmark.

REPLACE WHERE is available now in Beta — and it is available in both Spark Lakeflow Spark Declarative Pipelines and DBSQL. You can try it today in the Databricks SQL Query Editor, running directly on your SQL warehouse.

The problems incremental REPLACE WHERE solves

The standing predicate — a condition baked into the pipeline definition that tells every scheduled run "this is the slice you own" — is the everyday case. It solves a few types of problems naturally, all the same shape: a slice you can name upfront that is the same every run.

Rolling refresh. A source continuously adds new rows and occasionally corrects older ones. You want each nightly run to refresh the last 7 days so corrections or late-arriving data get absorbed, and everything older — which is stable — left untouched. A standing REPLACE_WHERE order_date >= date_add(current_date(), -7) becomes the pipeline's contract: every run rebuilds the trailing 7-day window and leaves the rest of the table as-is.

Building lasting tables from short-retention sources. Your raw events table keeps only a few days of data, while the daily aggregates you derive from it need to live permanently. A standing REPLACE_WHERE event_date >= date_add(current_date(), -3) rebuilds the recent three days each run, and older days persist. A Materialized View would remove rows where the event date is older than 3, since the source no longer contains them.

Scoping dimension updates to recent facts only. You enrich a fact table by joining a dimension and writing the dimension value into the fact row at processing time. When the dimension changes later (a user moves regions, a product gets reclassified), you don't want every historical fact row rewritten — purely a cost decision: rewriting five years of facts every time a dimension row updates is expensive. A standing REPLACE_WHERE order_date >= date_add(current_date(), -3) scopes each refresh to the recent window, so only recent facts pick up the current dimension values, while older facts retain the values they were enriched with originally.

What these share: the slice each run owns is knowable up front and the same every run. You declare it once in the pipeline definition, and every scheduled update honors it.

Deep dive into REPLACE WHERE

1. Syntax overview

The syntax to use REPLACE WHERE is simple; the predicate is the table's contract, not part of the SELECT. You declare which slice this table owns once, at the top of the flow. SDP applies the predicate automatically — don't repeat it inside the body, and don't filter the source; SDP only reads the rows that match.

CREATE STREAMING TABLE silver_orders
FLOW REPLACE WHERE order_date >= date_add(current_date(), -7) BY NAME
SELECT
  order_id, order_date, region, product_id, customer_id,
  quantity, unit_price,
  CAST(quantity * unit_price AS DECIMAL(12,2)) AS revenue,
  _ingested_at
FROM bronze_orders;

To make this concrete, here's a self-contained three-layer example that you can copy and paste into a pipeline:

Source Data Generation: Run it into the SQL editor to generate source data

CREATE SCHEMA IF NOT EXISTS <catalog>.<schema>;

USE CATALOG <catalog>;
USE SCHEMA <schema>;

CREATE TABLE IF NOT EXISTS bronze_orders (
  order_id STRING, order_date DATE, customer_id STRING, product_id STRING, region STRING,
  customer_segment STRING, customer_name STRING, product_name STRING, product_category STRING,
  list_price DECIMAL(10,2), quantity INT, unit_price DECIMAL(10,2),
  event_ts TIMESTAMP, _seq BIGINT, _ingested_at TIMESTAMP
)
CLUSTER BY (order_date)
TBLPROPERTIES (delta.enableRowTracking = true);

INSERT INTO bronze_orders
SELECT
  concat('ORD-', id),
  date_sub(current_date(), CAST(id % 90 AS INT)),
  concat('CUST-', lpad(CAST(id % 500 AS STRING), 5, '0')),
  concat('SKU-',  lpad(CAST(id % 25  AS STRING), 3, '0')),
  element_at(array('AMER','EMEA','APAC','LATAM'), CAST(id % 4 AS INT) + 1),
  element_at(array('Bronze','Silver','Gold','Platinum'), CAST(((id % 4) + CAST(id / 5000 AS INT)) % 4 AS INT) + 1),
  concat('Customer ', CAST(id % 500 AS INT)),
  concat('Product ',  CAST(id % 25  AS INT)),
  element_at(array('Electronics','Apparel','Home','Grocery','Sports'), CAST(id % 5 AS INT) + 1),
  CAST(20 + (((id % 25) + CAST(id / 5000 AS INT)) % 10) * 5 AS DECIMAL(10,2)),
  CAST((id % 9) + 1 AS INT),
  CAST(10 + (id % 190) AS DECIMAL(10,2)),
  current_timestamp(), id, current_timestamp()
FROM range(0, 2000);

02_silver.sql: Replace the last 7 days only

CREATE STREAMING TABLE silver_orders
CLUSTER BY (order_date)
FLOW REPLACE WHERE order_date >= date_add(current_date(), -7) BY NAME
SELECT
  order_id,
  order_date,
  customer_id,
  product_id,
  region,
  quantity,
  unit_price,
  CAST(quantity * unit_price AS DECIMAL(12,2)) AS revenue,
  event_ts,
  _ingested_at
FROM bronze_orders;

03_gold.sql: Combines AUTO CDC for dimensions and REPLACE WHERE for the fact table

CREATE OR REFRESH STREAMING TABLE dim_customer
COMMENT "SCD Type 1 customer dimension (AUTO CDC)."
FLOW AUTO CDC
FROM STREAM bronze_orders
KEYS (customer_id)
SEQUENCE BY _seq
COLUMNS customer_id, region, customer_segment, customer_name
STORED AS SCD TYPE 1;

CREATE OR REFRESH STREAMING TABLE dim_product
COMMENT "SCD Type 1 product dimension (AUTO CDC)."
FLOW AUTO CDC
FROM STREAM bronze_orders
KEYS (product_id)
SEQUENCE BY _seq
COLUMNS product_id, product_name, product_category, list_price
STORED AS SCD TYPE 1;

CREATE STREAMING TABLE daily_sales
CLUSTER BY (order_date)
FLOW REPLACE WHERE order_date >= date_add(current_date(), -7) BY NAME
SELECT
  order_date,
  customer_id,
  product_id,
  COUNT(*)                            AS order_count,
  SUM(quantity)                       AS units_sold,
  CAST(SUM(revenue) AS DECIMAL(18,2)) AS revenue_usd
FROM silver_orders
GROUP BY ALL;

As you can see, we created a bronze table that seeds orders inline (no external source needed), a silver streaming table fed by a REPLACE WHERE flow that refreshes only the last 7 days on every update, and the downstream gold dimensions and fact table. In the gold layer, you can also see how AUTO CDC and REPLACE WHERE both can be used in the same pipeline for the creation of the dimensions and the fact table.

You can learn more about AUTO CDC in our previous blogpost: https://community.databricks.com/t5/technical-blog/from-150-lines-of-merge-into-to-7-lines-of-sql-au...

2. The magic under the hood: the Enzyme optimizer

This is what makes a REPLACE WHERE flow inside SDP different from the same operation written by hand in Spark.

First, the baseline. Every REPLACE WHERE update commits as a single Delta transaction that, mechanically: (1) deletes the rows matching the predicate, (2) recomputes the source query for the predicate range — the predicate is pushed down to the source read, so rows outside the window are never scanned — and (3) inserts the recomputed rows. That alone gives you two guarantees for free: the update is atomic and recoverable (readers never see a partial state; a failed run leaves the last good state intact), and it is predicate-scoped (rows outside the window are never touched — no file rewrite, no version churn).

Now compare this to doing it yourself in DBSQL. Outside SDP, you would do something like this:

INSERT INTO TABLE events REPLACE WHERE start_date >= (NOW() - INTERVAL 7 DAY) SELECT * FROM replace_data

This statement always rewrites everything that matches the predicate, on every run, whether or not the underlying data changed. Tell it "the last 7 days," and it recomputes and rewrites all seven days, full stop — full source recompute over the window plus a full file rewrite. Dynamic partition overwrite also only works at partition granularity and requires the table to be physically partitioned the right way.

This is where Enzyme comes in. The engine doesn't stop at "overwrite everything that matches." Enzyme inspects the data matching the predicate, recognizes that most of it hasn't changed since the last run, and only rewrites the files that contain rows that actually changed. Same one-line predicate, a fraction of the I/O.

And it’s a lot faster and cheaper too. On an incremental TPC-DI benchmark, the incremental REPLACE WHERE ran 3.4x faster and 2.5x cheaper than a full REPLACE WHERE, with identical output.

The difference is what happens inside the predicate. A regular REPLACE WHERE overwrites everything that matches the predicate, so a 7-day window means all seven days get rewritten on every run, even the rows that didn't change. The incremental version scopes the work further: within that same window, it rewrites only the rows that actually changed and leaves the rest in place. You get the same result, but you stop paying to overwrite data that was already correct.

When incremental processing engages. Enzyme isn't guaranteed on every run — and that's fine, the output is always correct either way; the only difference is how much work it takes. It engages when all of these hold: the query shape is supported; the predicate references base columns (Predicates on derived values, such as aggregate or window function outputs disables incremental refresh); the predicate is deterministic (temporal functions such as current_date() are allowed, rand() is not); no external DML modified the window since the last update; and the window doesn't extend into rows the previous predicate excluded. About the type of computation to use, it will just work if you use DBSQL, but make sure you use serverless compute for Spark Declarative Pipelines. Finally, it is worth mentioning that the first refresh is always a full computation; incremental begins on the second update.

Diagnosing a fallback. If a condition isn't met, the engine falls back to a full recompute of the window (correct, just heavier) and records why in the pipeline event log's planning_information:

Reason code	What happened
EXTERNAL_CHANGE_IN_REPLACE_WINDOW	Something outside the pipeline modified rows inside the window since the last update.
REPLACE_WHERE_NOT_DETERMINISTIC	The current predicate uses non-deterministic expressions.
PRIOR_REPLACE_WHERE_NOT_DETERMINISTIC	The previous run's predicate wasn't deterministic.
UNSUPPORTED_REPLACE_WHERE_PREDICATE	The query shape or predicate isn't on the fast path (or this was a predicate override).
PRIOR_REPLACE_WHERE_PREDICATE_MISSING	The previous predicate is missing from the provenance.
STALE_TARGET_DATA_IN_WINDOW	The replace where predicated changed across updates, and the new window includes rows that were excluded by the previous predicate.
CANNOT_PUSH_PREDICATE	The replace where predicate could not be pushed down to at least one source.
REPLACE_WHERE_BACKFILL	The update is a backfill — a one-off predicate override used to populate historical data.

To stay on the fast path: predicate on base columns with a moving lower bound; include the predicate column in the GROUP BY for aggregations and in the join key for joins; and don't write to the target from outside the pipeline inside the replace window.

The predicate override: a one-time update for backfills and corrections

Most of the time, the pipeline runs normally, but occasionally a single update needs a different scope. A few common examples:

Targeted corrections. A conversion bug caused revenue for country = 'CA' to be incorrect over the last 30 days. The standing 7-day predicate won't reach back that far, and you don't want to recompute everything else. A one-time update can rewrite just country = 'CA' for that window, then the pipeline returns to its standing predicate.

Backfills: You add a new metric to a daily aggregate or change how a column handles an edge case. From now on, every run uses the new logic — but you also want it applied to a chosen historical range. Or: you keep the standing window small while validating the transformation, then load the full history once you're confident.

Both are predicate overrides — a one-time run where the standing predicate is swapped for a different expression, for that update only. The pipeline definition remains untouched; the next scheduled run reverts to the standing predicate.

In Spark Declarative Pipelines, the override currently runs through the API. The docs include a helper function with examples: https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data

In DBSQL, REFRESH overrides are coming soon; in the meantime, arbitrary DML is supported: you can INSERT, SELECT, or DELETE against the table from outside the pipeline, like for any other streaming table.

How to get started

REPLACE WHERE flows are in Beta. To wire one up: set your pipeline channel to PREVIEW, use Unity Catalog, and run on DBSQL or serverless compute (required for incremental processing via Enzyme).

A good way to start is to take the self-contained sample provided above, copy it to a new Spark Declartive Pipeline, and run it.

A few limitations to know: A table can have only one REPLACE WHERE flow (and can't also be an AUTO CDC or append target); expectations aren't supported on REPLACE WHERE targets (set them upstream); and FULL REFRESH is destructive — it re-runs the flow with its standing predicate, so any backfilled history is lost (set pipelines.reset.allowed = false to block it).

REPLACE WHERE in DBSQL https://docs.databricks.com/aws/en/ldp/dbsql/flows-replace-where
REPLACE WHERE flows for Lakeflow Declarative Pipelines — https://docs.databricks.com/aws/en/ldp/flows-replace-where
Backfilling historical data — https://docs.databricks.com/aws/en/ldp/flows-replace-where#backfill-historical-data
Incremental refresh & supported query shapes — https://docs.databricks.com/aws/en/ldp/flows-replace-where#best-practices-for-incremental-refresh
Limitations — https://docs.databricks.com/aws/en/ldp/flows-replace-where#limitations

If you have feedback or questions, drop them in the comments below.

Databricks Community

Incremental REPLACE WHERE Flows Brings Targeted Refreshes to SDP and DBSQL

The problems incremental REPLACE WHERE solves

Deep dive into REPLACE WHERE

1. Syntax overview

2. The magic under the hood: the Enzyme optimizer

The predicate override: a one-time update for backfills and corrections

How to get started

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks