Databricks Community

amitkumarvish · ‎06-09-2025

I'm currently working with Mosaic AI Vector Search on Databricks and using a Delta Sync index (TRIGGERED) to sync from a Delta table (with embedding column already precomputed) to a vector index.

However, I'm noticing the sync process is quite slow Roughly 10 minutes to sync around 100 chunks.

vsc.create_delta_sync_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=vs_index_fullname,
    source_table_name=source_table_fullname,
    pipeline_type="TRIGGERED"
  )

Are there any best practices, configuration tweaks, or recommendations to improve ingestion speed?

Vinay_M_R · ‎06-16-2025

Hello @amitkumarvish ,

I wish you a wonderful day ahead!

This could be due to dimension mismatch between precomputed embeddings and index expectations or due to storing unnecessary metadata columns or binary data in metadata fields which may creates serialization/deserialization bottlenecks

To optimize ingestion speed for your Mosaic AI Vector Search Delta Sync index in TRIGGERED mode, you can consider these recommendations from Databricks documentation and best practices:

Ensure adequate parallelization by verifying your source delta table is properly partitioned.
Remove unnecessary metadata columns from sync configuration, Don't store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
Verify embedding column uses efficient data types
Confirm precomputed embeddings match the model's expected dimensions
If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.
Verify you're using the latest Python SDK

If you are creating a Delta Sync Index with self-managed embeddings, Please use below configuration tweaks

vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=INDEX_NAME,
source_table_name=FULLY_QUALIFIED_TABLE_NAME,
primary_key="chunk_id",
embedding_vector_column="embedding", #Add this line
embedding_dimension=1024 #Add this line
pipeline_type="TRIGGERED"
)

For more best practices and recommendations to improve ingestion speed, I am sharing below official documentations:

https://docs.databricks.com/aws/en/generative-ai/vector-search-best-practices
https://www.perplexity.ai/search/i-m-currently-working-with-mos-ZVxiKktZTkCngS2EhAjUsA
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search

View solution in original post

Vinay_M_R · ‎06-16-2025

Hello @amitkumarvish ,

I wish you a wonderful day ahead!

This could be due to dimension mismatch between precomputed embeddings and index expectations or due to storing unnecessary metadata columns or binary data in metadata fields which may creates serialization/deserialization bottlenecks

To optimize ingestion speed for your Mosaic AI Vector Search Delta Sync index in TRIGGERED mode, you can consider these recommendations from Databricks documentation and best practices:

Ensure adequate parallelization by verifying your source delta table is properly partitioned.
Remove unnecessary metadata columns from sync configuration, Don't store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
Verify embedding column uses efficient data types
Confirm precomputed embeddings match the model's expected dimensions
If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.
Verify you're using the latest Python SDK

If you are creating a Delta Sync Index with self-managed embeddings, Please use below configuration tweaks

vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=INDEX_NAME,
source_table_name=FULLY_QUALIFIED_TABLE_NAME,
primary_key="chunk_id",
embedding_vector_column="embedding", #Add this line
embedding_dimension=1024 #Add this line
pipeline_type="TRIGGERED"
)

For more best practices and recommendations to improve ingestion speed, I am sharing below official documentations:

https://docs.databricks.com/aws/en/generative-ai/vector-search-best-practices
https://www.perplexity.ai/search/i-m-currently-working-with-mos-ZVxiKktZTkCngS2EhAjUsA
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search

Databricks Community

Delta Sync index - Slow Sync Performance from Delta Table to Mosaic AI Vector

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples