Databricks Community

cmunteanu · ‎03-22-2024

Hello All,

I am trying to follow the dbdemo called 'llm-rag-chatbot' available at the following link. The setup works Ok, and I have imported from the Databricks Marketplace an embedding model that is:

bge_large_en_v1_5

Running the notebook called: 01-Data-Preparation-and-Index I am stuck with an error when trying to create a Vector Search Index with Managed Embeddings and the BGE model that I have setup as a serving endpoint, previously. More specifically, the Vector Search endpoint provisions succesfully, but when executing the index creation and syncronization method: create_delta_sync_index, I get the following error:

----

Exception: Response content b'{"error_code":"INVALID_PARAMETER_VALUE","message":"Model serving endpoint bge-large-en configured with improper input: {\\"error_code\\": \\"BAD_REQUEST\\", \\"message\\": \\"Failed to enforce schema of data \' 0\\\\n0 Welcome to databricks vector search\' with schema \'[\'input\': string (required)]\'. Error: Model is missing inputs [\'input\']. Note that there were extra inputs: [0]\\"}"}', status_code 400

----

My code that calls this method is:

if not index_exists(vsc, VECTOR_SEARCH_ENDPOINT_NAME, vs_index_fullname):

print(f"Creating index {vs_index_fullname} on endpoint {VECTOR_SEARCH_ENDPOINT_NAME}...")

vsc.create_delta_sync_index(

endpoint_name=VECTOR_SEARCH_ENDPOINT_NAME,

index_name=vs_index_fullname,

source_table_name=source_table_fullname,

pipeline_type="TRIGGERED",

primary_key="id",

embedding_source_column='content', #The column containing our text

embedding_model_endpoint_name='bge-large-en'

#embedding_model_endpoint_name='gte_large'

)

I have tried changing to a different embedding model (GTE_LARGE), but still getting the above error.

I guess there is a incompatibilty between the input schema of the embedding model and the schema expected by the vector search endpoint.

Has any of you encountered this problem? I would appreciate if you could give me a hint on how to solve it using an embedded model from Databricks Marketplace.

Thanks !

Kaniz · ‎03-26-2024

Hi @cmunteanu,

Ensure that the input schema of your embedding model aligns with what the Vector Search endpoint expects. The ‘input’ column is crucial for the model to process text data correctly.
Verify the input requirements of the bge_large_en_v1_5 model. It should expect a column named ‘input’ containing text data.
Confirm that your data pipeline provides the necessary input format to the model.
When creating the Vector Search Index, ensure that you specify the correct parameters:
- embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
- embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.
If you’ve made changes to the model or schema, consider reindexing the Vector Search.
Note that the LLM (Language Model) cannot be modified after indexing, so any changes require reindex...¹
If you continue to face issues, consider using multimodal embeddings (such as CLIP) that can handle ...².
Retrieve using similarity search and link to images in a docstore.

View solution in original post

cmunteanu · ‎03-26-2024

Hello @Kaniz , thanks a lot for the information you provided. Anyhow, I have managed a workaround, by pre-computing the embeddings for each chunk. I have created an embedding column on the source table and used this column as input to the create_delta_sync_index method.

That is: substitute parameter embedding_source_column='content' for:

embedding_dimension=1024,

embedding_vector_column="embedding"

and the syncronization of the index with the source table worked just fine.

View solution in original post

Kaniz · ‎03-26-2024

Hi @cmunteanu,

Ensure that the input schema of your embedding model aligns with what the Vector Search endpoint expects. The ‘input’ column is crucial for the model to process text data correctly.
Verify the input requirements of the bge_large_en_v1_5 model. It should expect a column named ‘input’ containing text data.
Confirm that your data pipeline provides the necessary input format to the model.
When creating the Vector Search Index, ensure that you specify the correct parameters:
- embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
- embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.
If you’ve made changes to the model or schema, consider reindexing the Vector Search.
Note that the LLM (Language Model) cannot be modified after indexing, so any changes require reindex...¹
If you continue to face issues, consider using multimodal embeddings (such as CLIP) that can handle ...².
Retrieve using similarity search and link to images in a docstore.

cmunteanu · ‎03-26-2024

Hello @Kaniz , thanks a lot for the information you provided. Anyhow, I have managed a workaround, by pre-computing the embeddings for each chunk. I have created an embedding column on the source table and used this column as input to the create_delta_sync_index method.

That is: substitute parameter embedding_source_column='content' for:

embedding_dimension=1024,

embedding_vector_column="embedding"

and the syncronization of the index with the source table worked just fine.

jbellidocaceres · a week ago

Hi @Kaniz and @cmunteanu , I am having exactly the same problem to create the vector index and it seems that there could be a bug in the demo. What confuses me is that and even when using the Databricks UI, I can not manage to create the vector index.

Well, when running the demo, it stays for a long time repeating:

============

Waiting for index to be ready, this can take a few min... {'detailed_state': 'PROVISIONING_INITIAL_SNAPSHOT', 'message': 'Index is currently is in the process of syncing initial data. Check latest status: https://adb-393322312342211.5.azuredatabricks.net/explore/data/dev_talk/llm_rag/databricks_documenta...', 'indexed_row_count': 0, 'provisioning_status': {'initial_pipeline_sync_progress': {'latest_version_currently_processing': 1, 'num_synced_rows': 0, 'total_rows_to_sync': 14129, 'sync_progress_completion': 0.0, 'pipeline_metrics': {'total_sync_time_per_row_ms': 0.0, 'ingestion_metrics': {'ingestion_time_per_row_ms': 0.0, 'ingestion_batch_size': 300}, 'embedding_metrics': {'embedding_generation_time_per_row_ms': 0.0, 'embedding_generation_batch_size': 0}}}}, 'ready': False, 'index_url': 'adb-393322312342211.5.azuredatabricks.net/api/2.0/vector-search/endpoints/dbdemos_vs_endpoint/indexes/dev_talk.llm_rag.databricks_documentation_vs_index'} - pipeline url:adb-393322312342211.5.azuredatabricks.net/api/2.0/vector-search/endpoints/dbdemos_vs_endpoint/indexes/dev_talk.llm_rag.databricks_documentation_vs_index

Then after a long time the Cell stops with the following error message:

"HTTPError: 400 Client Error: Bad Request for url: https://australiaeast.azuredatabricks.net/api/2.0/vector-search/endpoints/dev_talk_desk.llm_rag.data..."

It seems that the url is wrong (this is the bug I was referring), it has the endpoint and the vector index path interchanged. It should be:

"https://australiaeast.azuredatabricks.net/api/2.0/vector-search/endpoints/dbdemos_vs_endpoint /indexes/dev_talk_desk.llm_rag.databricks_documentation_vs_index

dbdemos_vs_endpoint"

Just like in the output of the Cell that is showing above. There, the URL is showed correctly,

================

@Kaniz If any specific configuration is required regarding the embedding model, it would be good to have it specified. In your reply you said:

When creating the Vector Search Index, ensure that you specify the correct parameters:

embedding_source_column: This should match the column name containing your text data (e.g., ‘content’).
embedding_model_endpoint_name: Use ‘bge-large-en’ as you’ve set up this model as a serving endpoint.

All these specifications are correctly configured in the demo notebook. So, I am confused on what is left for us to configure.

@cmunteanu I have followed your suggestion of using a self managed embedding to create the vector index. It does work, in the sense that I created the vector index. But, I can not use (easily) the nice features of Databricks vector_search client that converts internally text to vectors and vice-versa. Which make things easier for the RAG - chatbot. Have you got around that?

Databricks Community

Dbdemo: LLM Chatbot With Retrieval Augmented Generation (RAG)

Get Certified at Data & AI Summit and Earn this Exclusive Databricks Jacket

Supercharge Your Code Generation

Registration now open! Databricks Data + AI Summit 2024

Announcing General Availability of Liquid Clustering

Introducing the Databricks AI Fund