Vector search index creation is incredibly slow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-05-2025 07:08 AM - edited 03-05-2025 07:15 AM
I am trying to create a vector search index for a Delta Table using Azure OpenAI embeddings (text-embedding-3-large). The table contains 5000 chunks with approx. 1000 tokens each. The OpenAI embeddings are generated through a Databricks model serving endpoint which forwards the embedding requests to our Azure deployment.
The latency of the index creation is incredibly high. To embed just 5000 chunks the initial sync takes over an hour. If I extrapolate this to 5 Mio. chunks, embedding them would take over a month.
The deployment in Azure can process 350K tokens/min and should not be the limiting factor.
I am currently creating the index like this:
vsc.create_delta_sync_index(
endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
index_name=vs_index_fullname,
source_table_name=source_table_fullname,
pipeline_type="TRIGGERED",
primary_key="id",
embedding_source_column='content'
embedding_model_endpoint_name='text-embedding-3-large'
)
Which options do I have for speeding up the index creation?

