Databricks Cheatsheet
Before you run the notebooks, here is a glossary table to get you oriented with some fields.
Press enter or click to view image in full size Author’s image
5e. Running the notebooks
- You can find the first 2 notebooks on my GitHub repo: https://github.com/lulu3202/databricks_s3_starter/tree/main
- Connect your notebook to the cluster we created in the previous step
Press enter or click to view image in full size Author’s image: Connect your notebook to the cluster (navigate to top right corner)
3. Start running through each notebook to understand various processes
4. Here is the notebook breakdown structure:
📓 Notebook 1: Traditional Data Engineering
- Loaded CSVs (e.g., support_tickets.csv) from S3
- Converted them to Delta format
- Saved them to S3 as external Delta tables
- Registered these tables under: workspace.smart_support
📘 Tables Created:
💡 Goal:
Structure raw support ticket data for easy analysis, reporting, or ML.
📓 Notebook 2: Unstructured Text Processing
- Read text files from S3 (billing_faq.txt, etc.)
- Split into paragraphs/chunks
- Tagged each with source (billing_faq, etc.)
- Saved output as external Delta tables in the schema smart_support
📘 Tables Created:
- bronze_billing_faq
- bronze_product_guide
- bronze_technical_faq
💡 Goal:
Prepare an unstructured knowledge base (for semantic search and RAG).
Press enter or click to view image in full size Author’s image: Tables created under smart_support schema
OPTIONAL NEXT STEPS: 📓Notebook 3: Setting up a Simple RAG Workflow
As a preparation step for RAG, the documents comprising the knowledge base will be split into chunks and stored in Delta tables. Each chunk will be embedded using a Databricks-hosted embedding model to generate dense vector representations.
These embeddings, along with metadata such as id, source, and content, will be stored in a Delta table. A Vector Search Index is then created in Databricks to enable fast semantic retrieval of the most relevant chunks based on user queries.
🔍 3 Steps to Set Up Vector Search in Databricks:
- Create a Vector Search Endpoint
- Use VectorSearchClient() to create an endpoint for indexing and searching embeddings.
2. Serve an Embedding Model
- Deploy the hosted model databricks-bge-large-en (or similar) using Model Serving for embedding document chunks.
3. Create a Vector Search Index
- Use create_delta_sync_index() to link your Delta table with the endpoint and embedding model.
- Enables similarity search on the content column using generated embeddings.
Note: You can find notebooks 1 and 2, along with the sample docs, in my GitHub repo (https://github.com/lulu3202/databricks_s3_starter/tree/main)