cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Sync index - Slow Sync Performance from Delta Table to Mosaic AI Vector

amitkumarvish
New Contributor II

I'm currently working with Mosaic AI Vector Search on Databricks and using a Delta Sync index (TRIGGERED) to sync from a Delta table (with embedding column already precomputed) to a vector index.

However, I'm noticing the sync process is quite slow Roughly 10 minutes to sync around 100 chunks. 

vsc.create_delta_sync_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=vs_index_fullname,
    source_table_name=source_table_fullname,
    pipeline_type="TRIGGERED"
  )

Are there any best practices, configuration tweaks, or recommendations to improve ingestion speed?

1 ACCEPTED SOLUTION

Accepted Solutions

Vinay_M_R
Databricks Employee
Databricks Employee

Hello @amitkumarvish ,

I wish you a wonderful day ahead!

This could be due to dimension mismatch between precomputed embeddings and index expectations or due to storing unnecessary metadata columns or binary data in metadata fields which may creates serialization/deserialization bottlenecks

To optimize ingestion speed for your Mosaic AI Vector Search Delta Sync index in TRIGGERED mode, you can consider these recommendations from Databricks documentation and best practices:

  • Ensure adequate parallelization by verifying your source delta table is properly partitioned.
  • Remove unnecessary metadata columns from sync configuration, Don't store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
  • Verify embedding column uses efficient data types
  • Confirm precomputed embeddings match the model's expected dimensions
  • If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.
  • Verify you're using the latest Python SDK

If you are creating a Delta Sync Index with self-managed embeddings, Please use below configuration tweaks

vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=INDEX_NAME,
source_table_name=FULLY_QUALIFIED_TABLE_NAME,
primary_key="chunk_id",
embedding_vector_column="embedding", #Add this line
embedding_dimension=1024 #Add this line
pipeline_type="TRIGGERED"
)

For more best practices and recommendations to improve ingestion speed, I am sharing below official documentations:

https://docs.databricks.com/aws/en/generative-ai/vector-search-best-practices
https://www.perplexity.ai/search/i-m-currently-working-with-mos-ZVxiKktZTkCngS2EhAjUsA
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search

 

 

View solution in original post

1 REPLY 1

Vinay_M_R
Databricks Employee
Databricks Employee

Hello @amitkumarvish ,

I wish you a wonderful day ahead!

This could be due to dimension mismatch between precomputed embeddings and index expectations or due to storing unnecessary metadata columns or binary data in metadata fields which may creates serialization/deserialization bottlenecks

To optimize ingestion speed for your Mosaic AI Vector Search Delta Sync index in TRIGGERED mode, you can consider these recommendations from Databricks documentation and best practices:

  • Ensure adequate parallelization by verifying your source delta table is properly partitioned.
  • Remove unnecessary metadata columns from sync configuration, Don't store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
  • Verify embedding column uses efficient data types
  • Confirm precomputed embeddings match the model's expected dimensions
  • If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.
  • Verify you're using the latest Python SDK

If you are creating a Delta Sync Index with self-managed embeddings, Please use below configuration tweaks

vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=INDEX_NAME,
source_table_name=FULLY_QUALIFIED_TABLE_NAME,
primary_key="chunk_id",
embedding_vector_column="embedding", #Add this line
embedding_dimension=1024 #Add this line
pipeline_type="TRIGGERED"
)

For more best practices and recommendations to improve ingestion speed, I am sharing below official documentations:

https://docs.databricks.com/aws/en/generative-ai/vector-search-best-practices
https://www.perplexity.ai/search/i-m-currently-working-with-mos-ZVxiKktZTkCngS2EhAjUsA
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now