cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Data Observability in Databricks

Datalight
Contributor

This is very General question more on the Design Side on Observability.

There are 500+ Data Pipeline build in healthcare domain using Azure and AWS Databricks.

May someone please help me how to design a system :-

1. Continuous track system health and behavior using telemetry data across multiple platform.

2. Determine the Issue Root cause and Scope

3. Identified Anamolies failure

I got something from chat-gpt, but if anyone has implemented similar solution in Production, Highly appreciate if they can share their learning, like [They tech stack used with very very high level on use of each component.

Thank You So Much.

2 REPLIES 2

balajij8
Contributor

You can start with tracking system health, determine root cause & anomalies based on info from tables & logs. I would start with the blog

https://community.databricks.com/t5/technical-blog/databricks-observability-using-grafana-and-promet... 

Add info from the system logs to these

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Datalight,

Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a practical architecture using built-in tools.


LAYER 1: SYSTEM TABLES -- YOUR OBSERVABILITY FOUNDATION

Databricks provides a rich set of system tables in the "system" catalog that give you account-wide telemetry out of the box. These are Delta tables you can query with SQL, build dashboards on, and set alerts against. Key tables for your use case:

- system.lakeflow.job_run_timeline: Start/end times, status for all job runs (pipeline health, SLA tracking)
- system.lakeflow.job_task_run_timeline: Task-level execution metrics (root cause: which task failed?)
- system.billing.usage: DBU consumption across all workloads (cost observability, anomaly detection)
- system.compute.node_timeline: Compute resource utilization (infrastructure health)
- system.compute.warehouse_events: SQL warehouse lifecycle events (warehouse performance)
- system.query.history: All queries on SQL warehouses and serverless (slow query detection)
- system.access.audit: All audit events across workspaces (security, compliance -- critical for healthcare)
- system.access.table_lineage: Read/write events on Unity Catalog tables (impact analysis)
- system.access.column_lineage: Column-level read/write operations (fine-grained data flow tracking)

Key point: Most system tables retain 365 days of history, and billing usage is free to query. Since you are running on both Azure and AWS, system tables give you a unified view across all workspaces attached to the same Unity Catalog metastore.

Docs: https://docs.databricks.com/admin/system-tables/


LAYER 2: DATA QUALITY WITH LAKEFLOW DECLARATIVE PIPELINES (DLT) EXPECTATIONS

If your 500+ pipelines use Lakeflow Declarative Pipelines, expectations are your first line of defense for data quality:

- expect (warn): Logs bad records but passes them through -- good for monitoring
- expect_or_drop: Silently drops invalid records -- good for filtering known-bad data
- expect_or_fail: Stops the pipeline on bad data -- good for critical data quality gates

Expectation results are written to the pipeline event log, which you can query with SQL to build dashboards tracking data quality trends over time.

Docs: https://docs.databricks.com/ldp/expectations


LAYER 3: LAKEHOUSE MONITORING (DATA PROFILING AND ANOMALY DETECTION)

Lakehouse Monitoring (Unity Catalog monitors) automatically computes data profiling metrics and detects drift/anomalies on your tables. It supports three analysis modes:

- Time Series: For timestamp-based data, computes metrics across time windows
- Snapshot: Profiles the full table on each refresh (up to 4TB)
- Inference: For ML model monitoring (if applicable in your healthcare domain)

For each monitored table, it generates:
1. A profile metrics table (null counts, distributions, statistics)
2. A drift metrics table (how data changes over time vs. a baseline)
3. An auto-generated dashboard for visualization

This is particularly powerful for healthcare data where schema completeness and value distributions matter.

Docs: https://docs.databricks.com/en/lakehouse-monitoring/


LAYER 4: UNITY CATALOG DATA LINEAGE FOR ROOT CAUSE ANALYSIS

For your second requirement (determine root cause and scope), Unity Catalog lineage is essential:

- Captures lineage at both table and column level across all languages (Python, SQL, Scala)
- Works cross-workspace, so your Azure and AWS pipelines are all visible (if connected to the same metastore)
- Visible in the Catalog Explorer UI with interactive upstream/downstream graphs
- Also available programmatically via system tables (system.access.table_lineage, system.access.column_lineage)
- Lineage data is retained for 1 year

Root cause workflow: When a downstream report shows bad data, trace upstream through the lineage graph to find which pipeline and transformation introduced the issue.

Docs: https://docs.databricks.com/data-governance/unity-catalog/data-lineage


LAYER 5: ALERTING AND NOTIFICATIONS

To make your observability system proactive (not just dashboards), set up Databricks SQL Alerts:

- Write SQL queries against system tables, event logs, or your data tables
- Set threshold conditions (e.g., "alert if job failure rate > 5% in last hour")
- Schedule them to run at regular intervals
- Route notifications to Email, Slack, PagerDuty, Microsoft Teams, or Webhooks

Example alert queries:
- Pipeline failures: Query system.lakeflow.job_run_timeline for failed runs
- Data freshness: Check if a table's last update timestamp exceeds your SLA
- Cost anomalies: Query system.billing.usage for unexpected DBU spikes
- Data quality: Query expectation results for spikes in failed records

Docs:
- SQL Alerts: https://docs.databricks.com/en/sql/user/alerts/index.html
- Notification Destinations: https://docs.databricks.com/en/admin/workspace-settings/notification-destinations.html


LAYER 6: METRIC VIEWS FOR STANDARDIZED KPIS

If you want to define standard data quality or operational KPIs that multiple teams can consume consistently, Metric Views are worth exploring:

- Define metrics centrally in YAML, registered in Unity Catalog
- Separate measures from dimensions so teams can slice/dice at query time
- Integrate with AI/BI dashboards and Genie spaces
- Support materialization for pre-computed aggregations

Docs: https://docs.databricks.com/metric-views/


PUTTING IT ALL TOGETHER

Here is the high-level architecture:

DATA SOURCES (500+ Pipelines)
|
v
[Layer 1] System Tables (system.lakeflow.*, system.billing.*, system.access.*)
--> Job/task health, cost, audit, lineage
[Layer 2] DLT Expectations + Event Logs
--> Data quality gates, violation tracking
[Layer 3] Lakehouse Monitoring
--> Data profiling, drift detection, anomaly detection
[Layer 4] Unity Catalog Lineage
--> Root cause tracing, impact analysis
|
v
OBSERVABILITY LAYER
+--> AI/BI Dashboards (operational views)
+--> SQL Alerts --> Slack / PagerDuty / Email / Teams
+--> Metric Views (standardized KPIs)
+--> (Optional) External tools via webhooks (Grafana, Datadog, etc.)


FOR CROSS-PLATFORM (AZURE + AWS)

Since you run on both clouds:
- Use Unity Catalog as your governance layer across both -- system tables and lineage will unify your view
- If the workspaces are in separate accounts, consider using Delta Sharing to share observability data between them
- The system tables approach works identically on both Azure and AWS


QUICK START RECOMMENDATION

If I were starting from scratch with 500+ pipelines, here is the order I would prioritize:

1. Enable system tables -- immediate visibility with zero pipeline changes
2. Set up SQL alerts on job failures and data freshness -- proactive notifications
3. Add DLT expectations to critical pipelines -- data quality gates
4. Enable Lakehouse Monitoring on your most important tables -- drift detection
5. Build dashboards combining system tables + event logs for a unified ops view
6. Define Metric Views for standardized health KPIs across teams

Hope this helps! Happy to go deeper on any specific layer.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.