cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dbdemo: LLM Chatbot With Retrieval Augmented Generation (RAG)

cmunteanu
New Contributor III

Hello All,

I am trying to follow the dbdemo called 'llm-rag-chatbot' available at the following link. The setup works Ok, and I have imported from the Databricks Marketplace an embedding model that is:

  • bge_large_en_v1_5

Running the notebook called: 01-Data-Preparation-and-Index  I am stuck with an error when trying to create a Vector Search Index with Managed Embeddings and the BGE model that I have setup as a serving endpoint, previously. More specifically, the Vector Search endpoint provisions succesfully, but when executing the index creation and syncronization method: create_delta_sync_indexI get the following error:

----
Exception: Response content b'{"error_code":"INVALID_PARAMETER_VALUE","message":"Model serving endpoint bge-large-en configured with improper input: {\\"error_code\\": \\"BAD_REQUEST\\", \\"message\\": \\"Failed to enforce schema of data \' 0\\\\n0 Welcome to databricks vector search\' with schema \'[\'input\': string (required)]\'. Error: Model is missing inputs [\'input\']. Note that there were extra inputs: [0]\\"}"}', status_code 400
----
 
My code that calls this method is:
if not index_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname):
  print(f"Creating index {vs_index_fullname} on endpoint {VECTOR_SEARCH_ENDPOINT_NAME}...")
  vsc.create_delta_sync_index(
    endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,
    index_name=vs_index_fullname,
    source_table_name=source_table_fullname,
    pipeline_type="TRIGGERED",
    primary_key="id",
    embedding_source_column='content', #The column containing our text
    embedding_model_endpoint_name='bge-large-en'
    #embedding_model_endpoint_name='gte_large' 
  )
I have tried changing to a different embedding model (GTE_LARGE), but still getting the above error.
I guess there is a incompatibilty between the input schema of the embedding model and the schema expected by the vector search endpoint.
 
Has any of you encountered this problem?  I would appreciate if you could give me a hint on how to solve it using an embedded model from Databricks Marketplace.
 
Thanks ! 
 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @cmunteanu

  • Ensure that the input schema of your embedding model aligns with what the Vector Search endpoint expects. The ‘input’ column is crucial for the model to process text data correctly.
  • Verify the input requirements of the bge_large_en_v1_5 model. It should expect a column named ‘input’ containing text data.
  • Confirm that your data pipeline provides the necessary input format to the model.
  • When creating the Vector Search Index, ensure that you specify the correct parameters:
    • embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
    • embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.
  • If you’ve made changes to the model or schema, consider reindexing the Vector Search.
  • Note that the LLM (Language Model) cannot be modified after indexing, so any changes require reindex...1
  • If you continue to face issues, consider using multimodal embeddings (such as CLIP) that can handle ...2.
  • Retrieve using similarity search and link to images in a docstore.

View solution in original post

cmunteanu
New Contributor III

Hello @Kaniz , thanks a lot for the information you provided. Anyhow, I have managed a workaround, by pre-computing the embeddings for each chunk.  I have created an embedding column on the source table and used this column as input to the create_delta_sync_index method.

That is: substitute parameter  embedding_source_column='content' for:
embedding_dimension=1024,
embedding_vector_column="embedding"
and the syncronization of the index with the source table worked just fine.
 

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @cmunteanu

  • Ensure that the input schema of your embedding model aligns with what the Vector Search endpoint expects. The ‘input’ column is crucial for the model to process text data correctly.
  • Verify the input requirements of the bge_large_en_v1_5 model. It should expect a column named ‘input’ containing text data.
  • Confirm that your data pipeline provides the necessary input format to the model.
  • When creating the Vector Search Index, ensure that you specify the correct parameters:
    • embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
    • embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.
  • If you’ve made changes to the model or schema, consider reindexing the Vector Search.
  • Note that the LLM (Language Model) cannot be modified after indexing, so any changes require reindex...1
  • If you continue to face issues, consider using multimodal embeddings (such as CLIP) that can handle ...2.
  • Retrieve using similarity search and link to images in a docstore.

cmunteanu
New Contributor III

Hello @Kaniz , thanks a lot for the information you provided. Anyhow, I have managed a workaround, by pre-computing the embeddings for each chunk.  I have created an embedding column on the source table and used this column as input to the create_delta_sync_index method.

That is: substitute parameter  embedding_source_column='content' for:
embedding_dimension=1024,
embedding_vector_column="embedding"
and the syncronization of the index with the source table worked just fine.
 

jbellidocaceres
New Contributor II
Hi  @Kaniz  and  @cmunteanu  , I am having exactly the same problem to create the vector index and it seems that there could be a bug in the demo. What confuses me is that and even when using the Databricks UI, I can not manage to create the vector index. 
 
Well, when running the demo, it stays for a long time repeating:
============
Waiting for index to be ready, this can take a few min... {'detailed_state': 'PROVISIONING_INITIAL_SNAPSHOT', 'message': 'Index is currently is in the process of syncing initial data. Check latest status: https://adb-393322312342211.5.azuredatabricks.net/explore/data/dev_talk/llm_rag/databricks_documenta...', 'indexed_row_count': 0, 'provisioning_status': {'initial_pipeline_sync_progress': {'latest_version_currently_processing': 1, 'num_synced_rows': 0, 'total_rows_to_sync': 14129, 'sync_progress_completion': 0.0, 'pipeline_metrics': {'total_sync_time_per_row_ms': 0.0, 'ingestion_metrics': {'ingestion_time_per_row_ms': 0.0, 'ingestion_batch_size': 300}, 'embedding_metrics': {'embedding_generation_time_per_row_ms': 0.0, 'embedding_generation_batch_size': 0}}}}, 'ready': False, 'index_url': 'adb-393322312342211.5.azuredatabricks.net/api/2.0/vector-search/endpoints/dbdemos_vs_endpoint/indexes/dev_talk.llm_rag.databricks_documentation_vs_index'} - pipeline url:adb-393322312342211.5.azuredatabricks.net/api/2.0/vector-search/endpoints/dbdemos_vs_endpoint/indexes/dev_talk.llm_rag.databricks_documentation_vs_index

Then after a long time the Cell stops with the following error message:
 
 
 
 
It seems that the url is wrong (this is the bug I was referring), it has the endpoint and the vector index path interchanged. It should be:
 
 
Just like in the output of the Cell that is showing above. There, the URL is showed correctly,
================
 @Kaniz If any specific configuration is required regarding the embedding model, it would be good to have it specified. In your reply you said:
  • When creating the Vector Search Index, ensure that you specify the correct parameters:
    • embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
    • embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.
All these specifications are correctly configured in the demo notebook. So, I am confused on what is left for us to configure.
 @cmunteanu  I have followed your suggestion of using a self managed embedding to create the vector index. It does work, in the sense that I created the vector index. But, I can not use (easily) the nice features of Databricks vector_search client that converts internally text to vectors  and vice-versa.  Which make things easier for the RAG - chatbot. Have you got around that?