Hello @amitkumarvish ,
I wish you a wonderful day ahead!
This could be due to dimension mismatch between precomputed embeddings and index expectations or due to storing unnecessary metadata columns or binary data in metadata fields which may creates serialization/deserialization bottlenecks
To optimize ingestion speed for your Mosaic AI Vector Search Delta Sync index in TRIGGERED mode, you can consider these recommendations from Databricks documentation and best practices:
- Ensure adequate parallelization by verifying your source delta table is properly partitioned.
- Remove unnecessary metadata columns from sync configuration, Don't store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.
- Verify embedding column uses efficient data types
- Confirm precomputed embeddings match the model's expected dimensions
- If you want to use cosine similarity you need to normalize your datapoint embeddings before feeding them into vector search. When the data points are normalized, the ranking produced by L2 distance is the same as the ranking produces by cosine similarity.
- Verify you're using the latest Python SDK
If you are creating a Delta Sync Index with self-managed embeddings, Please use below configuration tweaks
vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=INDEX_NAME,
source_table_name=FULLY_QUALIFIED_TABLE_NAME,
primary_key="chunk_id",
embedding_vector_column="embedding", #Add this line
embedding_dimension=1024 #Add this line
pipeline_type="TRIGGERED"
)
For more best practices and recommendations to improve ingestion speed, I am sharing below official documentations:
https://docs.databricks.com/aws/en/generative-ai/vector-search-best-practices
https://www.perplexity.ai/search/i-m-currently-working-with-mos-ZVxiKktZTkCngS2EhAjUsA
https://learn.microsoft.com/en-us/azure/databricks/generative-ai/vector-search