<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/141859#M1527</link>
    <description>&lt;P&gt;From community experience, vector index sync behavior depends heavily on how the Delta table is updated. With OVERWRITE, the table is effectively replaced, so the vector index typically treats this as a full refresh. Existing embeddings are dropped and rebuilt, which can be expensive and cause temporary unavailability. In contrast, MERGE is incremental: inserts, updates, and deletes are tracked at the row level, allowing the vector index to sync only changed records. This makes MERGE far more efficient and reliable for production pipelines. Best practice is to use MERGE for ongoing updates and reserve OVERWRITE for rare, full reprocessing scenarios.&lt;/P&gt;</description>
    <pubDate>Mon, 15 Dec 2025 13:24:45 GMT</pubDate>
    <dc:creator>jameswood32</dc:creator>
    <dc:date>2025-12-15T13:24:45Z</dc:date>
    <item>
      <title>Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/113272#M801</link>
      <description>&lt;P&gt;I'm working with vector search in Databricks using vector index sync with Delta tables, and I'm a bit unclear on how updates to the source table affect the vector index, specifically when using different write operations.&lt;/P&gt;&lt;P&gt;If I &lt;STRONG&gt;overwrite&lt;/STRONG&gt; the source Delta table that is synced to the vector index (using the overwrite mode), will &lt;STRONG&gt;all the&lt;/STRONG&gt; &lt;STRONG&gt;embeddings be recalculated &lt;/STRONG&gt;and the vector index fully refreshed?&lt;/P&gt;&lt;P&gt;On the other hand, if I use a &lt;STRONG&gt;MERGE operation&lt;/STRONG&gt; to upsert data into the source table, does the sync behave differently? For instance, are only the updated or inserted rows recalculated and synced?&lt;/P&gt;&lt;P&gt;Since we are using Azure OpenAI's embedding models for a high number of documents, fully recalculated embeddings would be somehow costly. And source Delta tables must have Change Data Feed enabled so I think embedding updates can be based on table change details.&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Fri, 21 Mar 2025 11:01:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/113272#M801</guid>
      <dc:creator>dfighter1312</dc:creator>
      <dc:date>2025-03-21T11:01:09Z</dc:date>
    </item>
    <item>
      <title>Re: Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/138153#M1348</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Overwriting a Delta table versus using a MERGE operation has different impacts on Databricks vector index sync, especially when Change Data Feed (CDF) is enabled and your embeddings are generated via Azure OpenAI models.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Overwrite Mode&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When you overwrite a Delta table that is synced to a vector index, the default behavior is that&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;all the rows in the table are replaced&lt;/STRONG&gt;, and therefore Databricks will&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;trigger a full recomputation of embeddings for all records&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the vector index. This is because the overwrite operation essentially makes the previous state of the table irrelevant; the new contents become the sole source of truth. As a result, the sync process refreshes the entire vector index, recalculating embeddings for every document in the Delta table—even unchanged ones—which can be very costly if you are dealing with large datasets and expensive embedding models.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;MERGE Operation&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The MERGE operation (also known as upsert) behaves much more efficiently with vector index sync,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;especially when Change Data Feed is enabled&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;MERGE makes targeted changes—new records are inserted, existing ones updated, or deleted.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;With CDF enabled on your Delta table, Databricks can track exactly&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;which rows were inserted, updated, or deleted&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The vector index sync process only recalculates embeddings for&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;those specific changed rows&lt;/STRONG&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Unchanged rows will not have their embeddings recomputed, which minimizes unnecessary calls to the embedding API and controls costs.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;This approach is both cost- and performance-optimal for large-scale applications where document updates are regular and only a subset of the corpus changes between syncs.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Why Delta CDF Matters&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;By enabling Change Data Feed, Databricks can identify per-transaction row-level changes. The sync process uses this information to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;only process changed rows&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(inserts, deletes, updates) for vector embedding recalculation and index update. This both preserves performance and reduces OpenAI API charges, as embeddings are recomputed only when absolutely necessary.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Avoid overwrite mode&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;unless you intend to refresh the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;entire&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;index and regenerate all embeddings.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Prefer MERGE/upsert operations&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;when making incremental changes—this leverages CDF to minimize embedding recomputation.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Always enable Change Data Feed&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for efficient, change-aware vector index syncing.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Fri, 07 Nov 2025 16:51:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/138153#M1348</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-07T16:51:31Z</dc:date>
    </item>
    <item>
      <title>Re: Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/141852#M1526</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/82205"&gt;@mark_ott&lt;/a&gt;&amp;nbsp;how does the Sync behave, when i only update columns that are not used for generating the embedding (and also not the id column)? Does the sync still process these rows and generate a new embedding? Is there a difference when having a writeback table vs not having one?&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2025 12:13:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/141852#M1526</guid>
      <dc:creator>dlehmann</dc:creator>
      <dc:date>2025-12-15T12:13:55Z</dc:date>
    </item>
    <item>
      <title>Re: Behavior of Vector Index Sync with Delta Tables When Using OVERWRITE vs MERGE in Databricks</title>
      <link>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/141859#M1527</link>
      <description>&lt;P&gt;From community experience, vector index sync behavior depends heavily on how the Delta table is updated. With OVERWRITE, the table is effectively replaced, so the vector index typically treats this as a full refresh. Existing embeddings are dropped and rebuilt, which can be expensive and cause temporary unavailability. In contrast, MERGE is incremental: inserts, updates, and deletes are tracked at the row level, allowing the vector index to sync only changed records. This makes MERGE far more efficient and reliable for production pipelines. Best practice is to use MERGE for ongoing updates and reserve OVERWRITE for rare, full reprocessing scenarios.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Dec 2025 13:24:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/behavior-of-vector-index-sync-with-delta-tables-when-using/m-p/141859#M1527</guid>
      <dc:creator>jameswood32</dc:creator>
      <dc:date>2025-12-15T13:24:45Z</dc:date>
    </item>
  </channel>
</rss>

