Databricks Community

Paolo_Finardi · ‎02-03-2025

Hey everyone,

I'm working on a project to enhance our data analysts' experience on the platform. In our company, we've done a great job with Unity Catalog, and we have comprehensive descriptions for all the tables and columns in our data lake.

I'm interested in creating a chatbot assistant to help our data analysts navigate and understand the data more efficiently. The idea is to have an assistant that knows the entire data dictionary and can answer queries about the data without requiring analysts to manually explore documentation or metadata.

I've tried using the Genie Space feature, but from what I understand, it requires selecting a few tables at a time. I'm looking for a solution where the assistant has access to the entire Unity Catalog metadata, providing a more seamless and holistic experience for the analysts.

Is there any out-of-the-box solution from Databricks that could support this? Or has anyone implemented something similar and can share insights on how to approach this? Any advice or suggestions would be greatly appreciated!

Thanks in advance for your help!

MariuszK · ‎02-03-2025

Hi,

I have a similar idea to create a virtual assistant that can answer questions about meta data.

Databricks doesn't have an tool like this, but my plan is to use RAG. You can read metadata from UC and load it into vector index and create an agent that will answer on questions. I test RAG application on Databricks and it works perfectly fine, now I'm planing to text it with metadata.

Paolo_Finardi · ‎02-03-2025

Hi MariuszK,

I can confirm that the RAG architecture on Databricks works very well—we’ve taken it to production and are very satisfied with the results. As a first step, I'll try creating a Genie Room on the system table to see if it can help retrieve the correct tables.
Please keep me updated on your project progress!

Mantsama4 · ‎02-05-2025

Hi Paolo,

Thanks for sharing your experience! It’s great to hear that the RAG architecture on Databricks is working well in production.

I’m particularly interested in how you approached the chunking strategy for document ingestion into the Vector Search Index database in Databricks. Could you share some insights on how you fine-tuned the chunking process to optimize retrieval performance? Specifically:

What chunk sizes worked best for your use case?
Did you implement any custom logic for breaking down structured vs. unstructured data?
How did you balance retrieval accuracy with performance?

Looking forward to your thoughts!

Thanks

Mantu

Mantu S

Paolo_Finardi · ‎02-27-2025

Hi Mantu,

Let me share some insights based on our experience of me and my team on a topic detection problem.
We used semantic chunker function in LangChain with using Text-embedding-ada-002 model from openAI.
For longer text we used a different approach involving gpt4 to make the semantic split.
If you are interested in the topic I can suggest you to have a look at this Medium article where my collegue explain the process in a very clear way "Transforming Voice of Customer into Actionable Insights with BERTopic and OpenAI on Databricks"

Hope this could help, let me know your thought about it.