cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Generate embeddings for 50 million rows in dataframe

vikram_p
New Contributor

Hello All,

I have dataframe with 5 million rows and before we can setup vector search endpoint against index, we want to generate embeddings column for each of those rows. Please suggest whats an optimal way to do this?

We are in development phase so we need to do full load but later we will need to do same for incremental load.

Thanks & Regards,

Vikram

1 REPLY 1

bianca_unifeye
New Contributor II

The easiest and most reliable way to generate embeddings for millions of rows is to let Databricks Vector Search compute them automatically during synchronization from a Delta table.
Vector Search can generate embeddings for you, keep them updated when new records are inserted or updated, and handle batching, scaling, and retries behind the scenes.

You donโ€™t have to manually loop over rows or call a model serving endpoint, Vector Search handles that for you.

https://learn.microsoft.com/en-us/azure/databricks/generative-ai/create-query-vector-search

  • Handles full backfill (5M+ rows) efficiently

  • Supports incremental updates automatically via Delta change data

  • No manual code or loops required

  • Fully managed and Unity Catalogโ€“governed

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now