<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data Observability in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/149378#M53081</link>
    <description>&lt;P data-unlink="true"&gt;You can start with tracking &lt;SPAN&gt;system health, determine root cause &amp;amp; anomalies based on info from tables &amp;amp; logs. I would start with the blog&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-unlink="true"&gt;&lt;SPAN&gt;&lt;A href="https://community.databricks.com/t5/technical-blog/databricks-observability-using-grafana-and-prometheus/ba-p/96849" target="_blank"&gt;https://community.databricks.com/t5/technical-blog/databricks-observability-using-grafana-and-prometheus/ba-p/96849&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-unlink="true"&gt;&lt;SPAN&gt;Add info from the system logs to these&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 26 Feb 2026 14:23:01 GMT</pubDate>
    <dc:creator>balajij8</dc:creator>
    <dc:date>2026-02-26T14:23:01Z</dc:date>
    <item>
      <title>Data Observability in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/149369#M53080</link>
      <description>&lt;P&gt;This is very General question more on the Design Side on Observability.&lt;/P&gt;&lt;P&gt;There are 500+ Data Pipeline build in healthcare domain using Azure and AWS Databricks.&lt;/P&gt;&lt;P&gt;May someone please help me how to design a system :-&lt;/P&gt;&lt;P&gt;1. Continuous track system health and behavior using telemetry data across multiple platform.&lt;/P&gt;&lt;P&gt;2. Determine the Issue Root cause and Scope&lt;/P&gt;&lt;P&gt;3. Identified Anamolies failure&lt;/P&gt;&lt;P&gt;I got something from chat-gpt, but if anyone has implemented similar solution in Production, Highly appreciate if they can share their learning, like [They tech stack used with very very high level on use of each component.&lt;/P&gt;&lt;P&gt;Thank You So Much.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Feb 2026 11:32:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/149369#M53080</guid>
      <dc:creator>Datalight</dc:creator>
      <dc:date>2026-02-26T11:32:44Z</dc:date>
    </item>
    <item>
      <title>Re: Data Observability in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/149378#M53081</link>
      <description>&lt;P data-unlink="true"&gt;You can start with tracking &lt;SPAN&gt;system health, determine root cause &amp;amp; anomalies based on info from tables &amp;amp; logs. I would start with the blog&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-unlink="true"&gt;&lt;SPAN&gt;&lt;A href="https://community.databricks.com/t5/technical-blog/databricks-observability-using-grafana-and-prometheus/ba-p/96849" target="_blank"&gt;https://community.databricks.com/t5/technical-blog/databricks-observability-using-grafana-and-prometheus/ba-p/96849&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P data-unlink="true"&gt;&lt;SPAN&gt;Add info from the system logs to these&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Feb 2026 14:23:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/149378#M53081</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-02-26T14:23:01Z</dc:date>
    </item>
    <item>
      <title>Re: Data Observability in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/150084#M53232</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179126"&gt;@Datalight&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a practical architecture using built-in tools.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 1: SYSTEM TABLES -- YOUR OBSERVABILITY FOUNDATION&lt;/P&gt;
&lt;P&gt;Databricks provides a rich set of system tables in the "system" catalog that give you account-wide telemetry out of the box. These are Delta tables you can query with SQL, build dashboards on, and set alerts against. Key tables for your use case:&lt;/P&gt;
&lt;P&gt;- system.lakeflow.job_run_timeline: Start/end times, status for all job runs (pipeline health, SLA tracking)&lt;BR /&gt;- system.lakeflow.job_task_run_timeline: Task-level execution metrics (root cause: which task failed?)&lt;BR /&gt;- system.billing.usage: DBU consumption across all workloads (cost observability, anomaly detection)&lt;BR /&gt;- system.compute.node_timeline: Compute resource utilization (infrastructure health)&lt;BR /&gt;- system.compute.warehouse_events: SQL warehouse lifecycle events (warehouse performance)&lt;BR /&gt;- system.query.history: All queries on SQL warehouses and serverless (slow query detection)&lt;BR /&gt;- system.access.audit: All audit events across workspaces (security, compliance -- critical for healthcare)&lt;BR /&gt;- system.access.table_lineage: Read/write events on Unity Catalog tables (impact analysis)&lt;BR /&gt;- system.access.column_lineage: Column-level read/write operations (fine-grained data flow tracking)&lt;/P&gt;
&lt;P&gt;Key point: Most system tables retain 365 days of history, and billing usage is free to query. Since you are running on both Azure and AWS, system tables give you a unified view across all workspaces attached to the same Unity Catalog metastore.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/admin/system-tables/" target="_blank"&gt;https://docs.databricks.com/admin/system-tables/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 2: DATA QUALITY WITH LAKEFLOW DECLARATIVE PIPELINES (DLT) EXPECTATIONS&lt;/P&gt;
&lt;P&gt;If your 500+ pipelines use Lakeflow Declarative Pipelines, expectations are your first line of defense for data quality:&lt;/P&gt;
&lt;P&gt;- expect (warn): Logs bad records but passes them through -- good for monitoring&lt;BR /&gt;- expect_or_drop: Silently drops invalid records -- good for filtering known-bad data&lt;BR /&gt;- expect_or_fail: Stops the pipeline on bad data -- good for critical data quality gates&lt;/P&gt;
&lt;P&gt;Expectation results are written to the pipeline event log, which you can query with SQL to build dashboards tracking data quality trends over time.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/ldp/expectations" target="_blank"&gt;https://docs.databricks.com/ldp/expectations&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 3: LAKEHOUSE MONITORING (DATA PROFILING AND ANOMALY DETECTION)&lt;/P&gt;
&lt;P&gt;Lakehouse Monitoring (Unity Catalog monitors) automatically computes data profiling metrics and detects drift/anomalies on your tables. It supports three analysis modes:&lt;/P&gt;
&lt;P&gt;- Time Series: For timestamp-based data, computes metrics across time windows&lt;BR /&gt;- Snapshot: Profiles the full table on each refresh (up to 4TB)&lt;BR /&gt;- Inference: For ML model monitoring (if applicable in your healthcare domain)&lt;/P&gt;
&lt;P&gt;For each monitored table, it generates:&lt;BR /&gt;1. A profile metrics table (null counts, distributions, statistics)&lt;BR /&gt;2. A drift metrics table (how data changes over time vs. a baseline)&lt;BR /&gt;3. An auto-generated dashboard for visualization&lt;/P&gt;
&lt;P&gt;This is particularly powerful for healthcare data where schema completeness and value distributions matter.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/lakehouse-monitoring/" target="_blank"&gt;https://docs.databricks.com/en/lakehouse-monitoring/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 4: UNITY CATALOG DATA LINEAGE FOR ROOT CAUSE ANALYSIS&lt;/P&gt;
&lt;P&gt;For your second requirement (determine root cause and scope), Unity Catalog lineage is essential:&lt;/P&gt;
&lt;P&gt;- Captures lineage at both table and column level across all languages (Python, SQL, Scala)&lt;BR /&gt;- Works cross-workspace, so your Azure and AWS pipelines are all visible (if connected to the same metastore)&lt;BR /&gt;- Visible in the Catalog Explorer UI with interactive upstream/downstream graphs&lt;BR /&gt;- Also available programmatically via system tables (system.access.table_lineage, system.access.column_lineage)&lt;BR /&gt;- Lineage data is retained for 1 year&lt;/P&gt;
&lt;P&gt;Root cause workflow: When a downstream report shows bad data, trace upstream through the lineage graph to find which pipeline and transformation introduced the issue.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/data-governance/unity-catalog/data-lineage" target="_blank"&gt;https://docs.databricks.com/data-governance/unity-catalog/data-lineage&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 5: ALERTING AND NOTIFICATIONS&lt;/P&gt;
&lt;P&gt;To make your observability system proactive (not just dashboards), set up Databricks SQL Alerts:&lt;/P&gt;
&lt;P&gt;- Write SQL queries against system tables, event logs, or your data tables&lt;BR /&gt;- Set threshold conditions (e.g., "alert if job failure rate &amp;gt; 5% in last hour")&lt;BR /&gt;- Schedule them to run at regular intervals&lt;BR /&gt;- Route notifications to Email, Slack, PagerDuty, Microsoft Teams, or Webhooks&lt;/P&gt;
&lt;P&gt;Example alert queries:&lt;BR /&gt;- Pipeline failures: Query system.lakeflow.job_run_timeline for failed runs&lt;BR /&gt;- Data freshness: Check if a table's last update timestamp exceeds your SLA&lt;BR /&gt;- Cost anomalies: Query system.billing.usage for unexpected DBU spikes&lt;BR /&gt;- Data quality: Query expectation results for spikes in failed records&lt;/P&gt;
&lt;P&gt;Docs:&lt;BR /&gt;- SQL Alerts: &lt;A href="https://docs.databricks.com/en/sql/user/alerts/index.html" target="_blank"&gt;https://docs.databricks.com/en/sql/user/alerts/index.html&lt;/A&gt;&lt;BR /&gt;- Notification Destinations: &lt;A href="https://docs.databricks.com/en/admin/workspace-settings/notification-destinations.html" target="_blank"&gt;https://docs.databricks.com/en/admin/workspace-settings/notification-destinations.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;LAYER 6: METRIC VIEWS FOR STANDARDIZED KPIS&lt;/P&gt;
&lt;P&gt;If you want to define standard data quality or operational KPIs that multiple teams can consume consistently, Metric Views are worth exploring:&lt;/P&gt;
&lt;P&gt;- Define metrics centrally in YAML, registered in Unity Catalog&lt;BR /&gt;- Separate measures from dimensions so teams can slice/dice at query time&lt;BR /&gt;- Integrate with AI/BI dashboards and Genie spaces&lt;BR /&gt;- Support materialization for pre-computed aggregations&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/metric-views/" target="_blank"&gt;https://docs.databricks.com/metric-views/&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;PUTTING IT ALL TOGETHER&lt;/P&gt;
&lt;P&gt;Here is the high-level architecture:&lt;/P&gt;
&lt;P&gt;DATA SOURCES (500+ Pipelines)&lt;BR /&gt;|&lt;BR /&gt;v&lt;BR /&gt;[Layer 1] System Tables (system.lakeflow.*, system.billing.*, system.access.*)&lt;BR /&gt;--&amp;gt; Job/task health, cost, audit, lineage&lt;BR /&gt;[Layer 2] DLT Expectations + Event Logs&lt;BR /&gt;--&amp;gt; Data quality gates, violation tracking&lt;BR /&gt;[Layer 3] Lakehouse Monitoring&lt;BR /&gt;--&amp;gt; Data profiling, drift detection, anomaly detection&lt;BR /&gt;[Layer 4] Unity Catalog Lineage&lt;BR /&gt;--&amp;gt; Root cause tracing, impact analysis&lt;BR /&gt;|&lt;BR /&gt;v&lt;BR /&gt;OBSERVABILITY LAYER&lt;BR /&gt;+--&amp;gt; AI/BI Dashboards (operational views)&lt;BR /&gt;+--&amp;gt; SQL Alerts --&amp;gt; Slack / PagerDuty / Email / Teams&lt;BR /&gt;+--&amp;gt; Metric Views (standardized KPIs)&lt;BR /&gt;+--&amp;gt; (Optional) External tools via webhooks (Grafana, Datadog, etc.)&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;FOR CROSS-PLATFORM (AZURE + AWS)&lt;/P&gt;
&lt;P&gt;Since you run on both clouds:&lt;BR /&gt;- Use Unity Catalog as your governance layer across both -- system tables and lineage will unify your view&lt;BR /&gt;- If the workspaces are in separate accounts, consider using Delta Sharing to share observability data between them&lt;BR /&gt;- The system tables approach works identically on both Azure and AWS&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;QUICK START RECOMMENDATION&lt;/P&gt;
&lt;P&gt;If I were starting from scratch with 500+ pipelines, here is the order I would prioritize:&lt;/P&gt;
&lt;P&gt;1. Enable system tables -- immediate visibility with zero pipeline changes&lt;BR /&gt;2. Set up SQL alerts on job failures and data freshness -- proactive notifications&lt;BR /&gt;3. Add DLT expectations to critical pipelines -- data quality gates&lt;BR /&gt;4. Enable Lakehouse Monitoring on your most important tables -- drift detection&lt;BR /&gt;5. Build dashboards combining system tables + event logs for a unified ops view&lt;BR /&gt;6. Define Metric Views for standardized health KPIs across teams&lt;/P&gt;
&lt;P&gt;Hope this helps! Happy to go deeper on any specific layer.&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
      <pubDate>Sat, 07 Mar 2026 20:12:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-observability-in-databricks/m-p/150084#M53232</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-07T20:12:46Z</dc:date>
    </item>
  </channel>
</rss>

