Databricks Community

dfighter1312 · ‎03-21-2025

I'm working with vector search in Databricks using vector index sync with Delta tables, and I'm a bit unclear on how updates to the source table affect the vector index, specifically when using different write operations.

If I overwrite the source Delta table that is synced to the vector index (using the overwrite mode), will all the embeddings be recalculated and the vector index fully refreshed?

On the other hand, if I use a MERGE operation to upsert data into the source table, does the sync behave differently? For instance, are only the updated or inserted rows recalculated and synced?

Since we are using Azure OpenAI's embedding models for a high number of documents, fully recalculated embeddings would be somehow costly. And source Delta tables must have Change Data Feed enabled so I think embedding updates can be based on table change details.

Thanks in advance!

Darwin

mark_ott · ‎11-07-2025

Overwriting a Delta table versus using a MERGE operation has different impacts on Databricks vector index sync, especially when Change Data Feed (CDF) is enabled and your embeddings are generated via Azure OpenAI models.

Overwrite Mode

When you overwrite a Delta table that is synced to a vector index, the default behavior is that all the rows in the table are replaced, and therefore Databricks will trigger a full recomputation of embeddings for all records in the vector index. This is because the overwrite operation essentially makes the previous state of the table irrelevant; the new contents become the sole source of truth. As a result, the sync process refreshes the entire vector index, recalculating embeddings for every document in the Delta table—even unchanged ones—which can be very costly if you are dealing with large datasets and expensive embedding models.

MERGE Operation

The MERGE operation (also known as upsert) behaves much more efficiently with vector index sync, especially when Change Data Feed is enabled:

MERGE makes targeted changes—new records are inserted, existing ones updated, or deleted.
With CDF enabled on your Delta table, Databricks can track exactly which rows were inserted, updated, or deleted.
The vector index sync process only recalculates embeddings for those specific changed rows.
Unchanged rows will not have their embeddings recomputed, which minimizes unnecessary calls to the embedding API and controls costs.

This approach is both cost- and performance-optimal for large-scale applications where document updates are regular and only a subset of the corpus changes between syncs.

Why Delta CDF Matters

By enabling Change Data Feed, Databricks can identify per-transaction row-level changes. The sync process uses this information to only process changed rows (inserts, deletes, updates) for vector embedding recalculation and index update. This both preserves performance and reduces OpenAI API charges, as embeddings are recomputed only when absolutely necessary.

Recommendations

Avoid overwrite mode unless you intend to refresh the entire index and regenerate all embeddings.
Prefer MERGE/upsert operations when making incremental changes—this leverages CDF to minimize embedding recomputation.
Always enable Change Data Feed for efficient, change-aware vector index syncing.

dlehmann · 4 weeks ago

@mark_ott how does the Sync behave, when i only update columns that are not used for generating the embedding (and also not the id column)? Does the sync still process these rows and generate a new embedding? Is there a difference when having a writeback table vs not having one?

jameswood32 · 4 weeks ago

From community experience, vector index sync behavior depends heavily on how the Delta table is updated. With OVERWRITE, the table is effectively replaced, so the vector index typically treats this as a full refresh. Existing embeddings are dropped and rebuilt, which can be expensive and cause temporary unavailability. In contrast, MERGE is incremental: inserts, updates, and deletes are tracked at the row level, allowing the vector index to sync only changed records. This makes MERGE far more efficient and reliable for production pipelines. Best practice is to use MERGE for ongoing updates and reserve OVERWRITE for rare, full reprocessing scenarios.

James Wood