Vector search index creation is incredibly slow

Generative AI

Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.

I am trying to create a vector search index for a Delta Table using Azure OpenAI embeddings (text-embedding-3-large). The table contains 5000 chunks with approx. 1000 tokens each. The OpenAI embeddings are generated through a Databricks model serving endpoint which forwards the embedding requests to our Azure deployment.

The latency of the index creation is incredibly high. To embed just 5000 chunks the initial sync takes over an hour. If I extrapolate this to 5 Mio. chunks, embedding them would take over a month.

The deployment in Azure can process 350K tokens/min and should not be the limiting factor.

I am currently creating the index like this:

vsc.create_delta_sync_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=vs_index_fullname,
    source_table_name=source_table_fullname,
    pipeline_type="TRIGGERED",
    primary_key="id",
    embedding_source_column='content'
    embedding_model_endpoint_name='text-embedding-3-large'
  )

My assumption for the slow speed is that the index calculates the embeddings row by row, without async Azure API calls, and only starts with the next row once the previous embedding is created.

Which options do I have for speeding up the index creation?

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.