cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks

dfighter1312
New Contributor

I'm working with vector search in Databricks using vector index sync with Delta tables, and I'm a bit unclear on how updates to the source table affect the vector index, specifically when using different write operations.

If I overwrite the source Delta table that is synced to the vector index (using the overwrite mode), will all the embeddings be recalculated and the vector index fully refreshed?

On the other hand, if I use a MERGE operation to upsert data into the source table, does the sync behave differently? For instance, are only the updated or inserted rows recalculated and synced?

Since we are using Azure OpenAI's embedding models for a high number of documents, fully recalculated embeddings would be somehow costly. And source Delta tables must have Change Data Feed enabled so I think embedding updates can be based on table change details.

Thanks in advance!

Darwin
1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Overwriting a Delta table versus using a MERGE operation has different impacts on Databricks vector index sync, especially when Change Data Feed (CDF) is enabled and your embeddings are generated via Azure OpenAI models.

Overwrite Mode

When you overwrite a Delta table that is synced to a vector index, the default behavior is that all the rows in the table are replaced, and therefore Databricks will trigger a full recomputation of embeddings for all records in the vector index. This is because the overwrite operation essentially makes the previous state of the table irrelevant; the new contents become the sole source of truth. As a result, the sync process refreshes the entire vector index, recalculating embeddings for every document in the Delta table—even unchanged ones—which can be very costly if you are dealing with large datasets and expensive embedding models.

MERGE Operation

The MERGE operation (also known as upsert) behaves much more efficiently with vector index sync, especially when Change Data Feed is enabled:

  • MERGE makes targeted changes—new records are inserted, existing ones updated, or deleted.

  • With CDF enabled on your Delta table, Databricks can track exactly which rows were inserted, updated, or deleted.

  • The vector index sync process only recalculates embeddings for those specific changed rows.

  • Unchanged rows will not have their embeddings recomputed, which minimizes unnecessary calls to the embedding API and controls costs.

This approach is both cost- and performance-optimal for large-scale applications where document updates are regular and only a subset of the corpus changes between syncs.

Why Delta CDF Matters

By enabling Change Data Feed, Databricks can identify per-transaction row-level changes. The sync process uses this information to only process changed rows (inserts, deletes, updates) for vector embedding recalculation and index update. This both preserves performance and reduces OpenAI API charges, as embeddings are recomputed only when absolutely necessary.

Recommendations

  • Avoid overwrite mode unless you intend to refresh the entire index and regenerate all embeddings.

  • Prefer MERGE/upsert operations when making incremental changes—this leverages CDF to minimize embedding recomputation.

  • Always enable Change Data Feed for efficient, change-aware vector index syncing.