Databricks is positioning the platform as a full stack for LLM development — from data ingestion → feature/embedding pipelines → fine-tuning (Mosaic AI) → evaluation → deployment (Model Serving) → monitoring (Lakehouse Monitoring).
I’m curious about real-world experiences here:
1. Where do teams still rely on external components?
Vector search engines (Pinecone, Weaviate, Milvus) vs. Databricks Vector Search
Custom LLM gateways (OpenAI, Azure OpenAI, vLLM) vs. Databricks Model Serving
Feature/embedding stores outside Unity Catalog
CI/CD + model registry workflows (MLflow vs. SageMaker/Vertex pipelines)
Guardrails (Guardrails AI, Rebuff, LlamaGuard, Azure Content Filters)
Do you still find gaps in scalability, latency, or retrieval quality?
2. Does Databricks cover everything needed for enterprise-grade evaluation?
Automated hallucination scoring
Context relevancy and retrieval precision/recall
LLM-as-a-judge evaluation pipelines
Benchmark reproducibility across model versions
Multi-dataset evaluation (synthetic + real queries)
Or do you still need external tools like TruLens, Ragas, LangSmith, or DeepEval?
3. How mature is drift + quality monitoring for LLMs?
Text-based model drift detection is very different from regression/classification drift:
Prompt distribution drift
Embedding drift
Retrieval degradation over time
“Silent” quality drops in generative output
Detecting overfitting after fine-tuning
Latency/throughput instability for LLM-serving endpoints
Does Lakehouse Monitoring catch these fast enough?