<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How realistic is truly end-to-end LLMOps on Databricks? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141338#M11155</link>
    <description>&lt;P&gt;Databricks is positioning the platform as a full stack for LLM development — from data ingestion → feature/embedding pipelines → fine-tuning (Mosaic AI) → evaluation → deployment (Model Serving) → monitoring (Lakehouse Monitoring).&lt;/P&gt;&lt;P&gt;I’m curious about real-world experiences here:&lt;/P&gt;&lt;H4&gt;&lt;STRONG&gt;1. Where do teams still rely on external components?&lt;/STRONG&gt;&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Vector search engines (Pinecone, Weaviate, Milvus) vs. Databricks Vector Search&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Custom LLM gateways (OpenAI, Azure OpenAI, vLLM) vs. Databricks Model Serving&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Feature/embedding stores outside Unity Catalog&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;CI/CD + model registry workflows (MLflow vs. SageMaker/Vertex pipelines)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Guardrails (Guardrails AI, Rebuff, LlamaGuard, Azure Content Filters)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Do you still find gaps in scalability, latency, or retrieval quality?&lt;/P&gt;&lt;HR /&gt;&lt;H4&gt;&lt;STRONG&gt;2. Does Databricks cover everything needed for &lt;EM&gt;enterprise-grade evaluation&lt;/EM&gt;?&lt;/STRONG&gt;&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Automated hallucination scoring&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Context relevancy and retrieval precision/recall&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;LLM-as-a-judge evaluation pipelines&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Benchmark reproducibility across model versions&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Multi-dataset evaluation (synthetic + real queries)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Or do you still need external tools like &lt;STRONG&gt;TruLens, Ragas, LangSmith, or DeepEval&lt;/STRONG&gt;?&lt;/P&gt;&lt;HR /&gt;&lt;H4&gt;&lt;STRONG&gt;3. How mature is drift + quality monitoring for LLMs?&lt;/STRONG&gt;&lt;/H4&gt;&lt;P&gt;Text-based model drift detection is very different from regression/classification drift:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Prompt distribution drift&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Embedding drift&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Retrieval degradation over time&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;“Silent” quality drops in generative output&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Detecting overfitting after fine-tuning&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Latency/throughput instability for LLM-serving endpoints&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Does Lakehouse Monitoring catch these fast enough?&lt;/P&gt;</description>
    <pubDate>Sun, 07 Dec 2025 06:47:30 GMT</pubDate>
    <dc:creator>Poorva21</dc:creator>
    <dc:date>2025-12-07T06:47:30Z</dc:date>
    <item>
      <title>How realistic is truly end-to-end LLMOps on Databricks?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141338#M11155</link>
      <description>&lt;P&gt;Databricks is positioning the platform as a full stack for LLM development — from data ingestion → feature/embedding pipelines → fine-tuning (Mosaic AI) → evaluation → deployment (Model Serving) → monitoring (Lakehouse Monitoring).&lt;/P&gt;&lt;P&gt;I’m curious about real-world experiences here:&lt;/P&gt;&lt;H4&gt;&lt;STRONG&gt;1. Where do teams still rely on external components?&lt;/STRONG&gt;&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Vector search engines (Pinecone, Weaviate, Milvus) vs. Databricks Vector Search&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Custom LLM gateways (OpenAI, Azure OpenAI, vLLM) vs. Databricks Model Serving&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Feature/embedding stores outside Unity Catalog&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;CI/CD + model registry workflows (MLflow vs. SageMaker/Vertex pipelines)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Guardrails (Guardrails AI, Rebuff, LlamaGuard, Azure Content Filters)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Do you still find gaps in scalability, latency, or retrieval quality?&lt;/P&gt;&lt;HR /&gt;&lt;H4&gt;&lt;STRONG&gt;2. Does Databricks cover everything needed for &lt;EM&gt;enterprise-grade evaluation&lt;/EM&gt;?&lt;/STRONG&gt;&lt;/H4&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Automated hallucination scoring&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Context relevancy and retrieval precision/recall&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;LLM-as-a-judge evaluation pipelines&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Benchmark reproducibility across model versions&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Multi-dataset evaluation (synthetic + real queries)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Or do you still need external tools like &lt;STRONG&gt;TruLens, Ragas, LangSmith, or DeepEval&lt;/STRONG&gt;?&lt;/P&gt;&lt;HR /&gt;&lt;H4&gt;&lt;STRONG&gt;3. How mature is drift + quality monitoring for LLMs?&lt;/STRONG&gt;&lt;/H4&gt;&lt;P&gt;Text-based model drift detection is very different from regression/classification drift:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Prompt distribution drift&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Embedding drift&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Retrieval degradation over time&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;“Silent” quality drops in generative output&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Detecting overfitting after fine-tuning&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Latency/throughput instability for LLM-serving endpoints&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Does Lakehouse Monitoring catch these fast enough?&lt;/P&gt;</description>
      <pubDate>Sun, 07 Dec 2025 06:47:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141338#M11155</guid>
      <dc:creator>Poorva21</dc:creator>
      <dc:date>2025-12-07T06:47:30Z</dc:date>
    </item>
    <item>
      <title>Re: How realistic is truly end-to-end LLMOps on Databricks?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141348#M11157</link>
      <description>&lt;P&gt;&lt;FONT size="3"&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/199898"&gt;@Poorva21&lt;/a&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;In several projects I’ve seen and worked on, Databricks gets you &lt;EM&gt;very close&lt;/EM&gt; to a full end-to-end LLMOps platform, but not completely. It realistically covers most of the lifecycle, but in real production setups you still complement it with external pieces, mainly around CI/CD, guardrails, evaluation structure, and drift.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;FONT size="3"&gt;&lt;STRONG&gt;1. External components still appear&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;The most common external piece isn’t vector search or exotic engines, it’s actually the &lt;STRONG&gt;CI/CD layer (Jenkins or Azure DevOps)&lt;/STRONG&gt; plus the microservice that wraps the model.&amp;nbsp;&lt;/FONT&gt;&lt;FONT size="3"&gt;In practice, outside Databricks you still need:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;a microservice to handle requests (validations, retries, guardrails),&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;asynchronous orchestration (queues like RabbitMQ),&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;error handling and logging,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;rate limits, timeouts, business logic,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;CI/CD pipelines to deploy code + models.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;Databricks Model Serving is solid, but:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;cold starts from scaled-to-zero matter,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;latency isn’t always as stable as a dedicated gateway,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;handling adapters/LoRA weights sometimes needs manual cleanup or structure.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;So even if the model lives inside Databricks, the &lt;EM&gt;application layer&lt;/EM&gt; usually lives outside.&lt;/FONT&gt;&lt;/P&gt;&lt;H1&gt;&lt;U&gt;&lt;FONT size="3"&gt;&lt;STRONG&gt;2. Evaluation: Databricks has the tools, but teams often don’t unlock the full potential&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/H1&gt;&lt;P&gt;&lt;FONT size="3"&gt;This is the area where I see the biggest gap and it’s not always a platform limitation.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;Databricks gives you:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;UC-governed evaluation datasets&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Mosaic AI evaluation&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;MLflow model versions + lineage&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Serving logs&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Lakehouse Monitoring with custom metrics&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Dashboards / notebooks for analysis&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;But in reality many teams (including us sometimes) end up using:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;LangSmith,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;simple notebook-based evaluations,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;custom scripts,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;ad-hoc datasets,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;manual LLM-as-a-judge workflows.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;And honestly, I think this is often because &lt;STRONG&gt;we don’t fully know how far you can push evaluation inside Databricks.&amp;nbsp;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;FONT size="3"&gt;If we standardized the workflow (dataset → evaluation job → metrics → UC → dashboard), Databricks would cover much more than we give it credit for.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;So yes, LangSmith is nice, but Databricks already has many of the pieces, we just don’t always leverage them.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;U&gt;&lt;FONT size="3"&gt;&lt;STRONG&gt;3. Drift &amp;amp; quality monitoring: strong foundation, but LLM drift is still early&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/U&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;Lakehouse Monitoring shines for tabular ML and works very well when you plug in custom metrics (e.g., Hellinger Distance in our case).&amp;nbsp;&lt;/FONT&gt;&lt;FONT size="3"&gt;But LLM-specific drift still requires custom work:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;prompt distribution drift&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;embedding drift&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;retrieval degradation&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;silent quality drops&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;cold start or throughput instability&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;overfitting after fine-tuning&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;These aren’t “detectable out of the box” yet, but Databricks makes it easy to store metrics, monitor them, and alert, you just need to define what matters for your use case.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;So, for me Databricks is the closest thing to a true end-to-end LLMOps platform today.&amp;nbsp;&lt;/FONT&gt;&lt;FONT size="3"&gt;It really simplifies fine-tuning, serving, monitoring, data lineage, governance, and CI/CD integration.&amp;nbsp;&lt;/FONT&gt;&lt;FONT size="3"&gt;But in practice you still complement it with:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Jenkins or Azure DevOps for CI/CD,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;a microservice layer for orchestration + guardrails,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;external tools for evaluation when teams don’t fully leverage Databricks’ native capabilities,&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;and custom drift metrics for LLM-specific behaviors.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT size="3"&gt;It’s not 100% end-to-end yet, but it gets you much closer than anything else I’ve seen so far.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;Hope it helps!&amp;nbsp;&lt;BR /&gt;&lt;SPAN&gt;Gema&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class="lia-unicode-emoji"&gt;&lt;SPAN class="lia-unicode-emoji"&gt;&lt;span class="lia-unicode-emoji" title=":woman_technologist:"&gt;👩‍💻&lt;/span&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;BR /&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 07 Dec 2025 17:03:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141348#M11157</guid>
      <dc:creator>Gecofer</dc:creator>
      <dc:date>2025-12-07T17:03:07Z</dc:date>
    </item>
    <item>
      <title>Re: How realistic is truly end-to-end LLMOps on Databricks?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141690#M11184</link>
      <description>&lt;P&gt;Thank You&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135464"&gt;@Gecofer&lt;/a&gt;&amp;nbsp;for taking the time to share such a clear, experience-backed breakdown of where Databricks shines and where real-world LLM Ops architectures still need supporting components. Your explanation was incredibly practical and resonates a lot with what we see in production environments as well.&lt;/P&gt;&lt;P&gt;I especially appreciate how you distinguished between platform limitations and team adoption gaps—your points on evaluation workflows and drift monitoring hit the mark. The way you framed Databricks as “the closest thing to a true end-to-end LLMOps platform, but not entirely standalone” is probably the most accurate summary I’ve seen.&lt;/P&gt;&lt;P&gt;Thanks again for the detailed insights.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Dec 2025 16:48:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/how-realistic-is-truly-end-to-end-llmops-on-databricks/m-p/141690#M11184</guid>
      <dc:creator>Poorva21</dc:creator>
      <dc:date>2025-12-11T16:48:52Z</dc:date>
    </item>
  </channel>
</rss>

