Overwriting a Delta table versus using a MERGE operation has different impacts on Databricks vector index sync, especially when Change Data Feed (CDF) is enabled and your embeddings are generated via Azure OpenAI models.
Overwrite Mode
When you overwrite a Delta table that is synced to a vector index, the default behavior is that all the rows in the table are replaced, and therefore Databricks will trigger a full recomputation of embeddings for all records in the vector index. This is because the overwrite operation essentially makes the previous state of the table irrelevant; the new contents become the sole source of truth. As a result, the sync process refreshes the entire vector index, recalculating embeddings for every document in the Delta table—even unchanged ones—which can be very costly if you are dealing with large datasets and expensive embedding models.
MERGE Operation
The MERGE operation (also known as upsert) behaves much more efficiently with vector index sync, especially when Change Data Feed is enabled:
-
MERGE makes targeted changes—new records are inserted, existing ones updated, or deleted.
-
With CDF enabled on your Delta table, Databricks can track exactly which rows were inserted, updated, or deleted.
-
The vector index sync process only recalculates embeddings for those specific changed rows.
-
Unchanged rows will not have their embeddings recomputed, which minimizes unnecessary calls to the embedding API and controls costs.
This approach is both cost- and performance-optimal for large-scale applications where document updates are regular and only a subset of the corpus changes between syncs.
Why Delta CDF Matters
By enabling Change Data Feed, Databricks can identify per-transaction row-level changes. The sync process uses this information to only process changed rows (inserts, deletes, updates) for vector embedding recalculation and index update. This both preserves performance and reduces OpenAI API charges, as embeddings are recomputed only when absolutely necessary.
Recommendations
-
Avoid overwrite mode unless you intend to refresh the entire index and regenerate all embeddings.
-
Prefer MERGE/upsert operations when making incremental changes—this leverages CDF to minimize embedding recomputation.
-
Always enable Change Data Feed for efficient, change-aware vector index syncing.