Hi @Datalight,
Great question, and one that many organizations at your scale face. With 500+ pipelines across both Azure and AWS, you will want a layered observability approach that combines Databricks-native capabilities. Let me walk through a practical architecture using built-in tools.
LAYER 1: SYSTEM TABLES -- YOUR OBSERVABILITY FOUNDATION
Databricks provides a rich set of system tables in the "system" catalog that give you account-wide telemetry out of the box. These are Delta tables you can query with SQL, build dashboards on, and set alerts against. Key tables for your use case:
- system.lakeflow.job_run_timeline: Start/end times, status for all job runs (pipeline health, SLA tracking)
- system.lakeflow.job_task_run_timeline: Task-level execution metrics (root cause: which task failed?)
- system.billing.usage: DBU consumption across all workloads (cost observability, anomaly detection)
- system.compute.node_timeline: Compute resource utilization (infrastructure health)
- system.compute.warehouse_events: SQL warehouse lifecycle events (warehouse performance)
- system.query.history: All queries on SQL warehouses and serverless (slow query detection)
- system.access.audit: All audit events across workspaces (security, compliance -- critical for healthcare)
- system.access.table_lineage: Read/write events on Unity Catalog tables (impact analysis)
- system.access.column_lineage: Column-level read/write operations (fine-grained data flow tracking)
Key point: Most system tables retain 365 days of history, and billing usage is free to query. Since you are running on both Azure and AWS, system tables give you a unified view across all workspaces attached to the same Unity Catalog metastore.
Docs: https://docs.databricks.com/admin/system-tables/
LAYER 2: DATA QUALITY WITH LAKEFLOW DECLARATIVE PIPELINES (DLT) EXPECTATIONS
If your 500+ pipelines use Lakeflow Declarative Pipelines, expectations are your first line of defense for data quality:
- expect (warn): Logs bad records but passes them through -- good for monitoring
- expect_or_drop: Silently drops invalid records -- good for filtering known-bad data
- expect_or_fail: Stops the pipeline on bad data -- good for critical data quality gates
Expectation results are written to the pipeline event log, which you can query with SQL to build dashboards tracking data quality trends over time.
Docs: https://docs.databricks.com/ldp/expectations
LAYER 3: LAKEHOUSE MONITORING (DATA PROFILING AND ANOMALY DETECTION)
Lakehouse Monitoring (Unity Catalog monitors) automatically computes data profiling metrics and detects drift/anomalies on your tables. It supports three analysis modes:
- Time Series: For timestamp-based data, computes metrics across time windows
- Snapshot: Profiles the full table on each refresh (up to 4TB)
- Inference: For ML model monitoring (if applicable in your healthcare domain)
For each monitored table, it generates:
1. A profile metrics table (null counts, distributions, statistics)
2. A drift metrics table (how data changes over time vs. a baseline)
3. An auto-generated dashboard for visualization
This is particularly powerful for healthcare data where schema completeness and value distributions matter.
Docs: https://docs.databricks.com/en/lakehouse-monitoring/
LAYER 4: UNITY CATALOG DATA LINEAGE FOR ROOT CAUSE ANALYSIS
For your second requirement (determine root cause and scope), Unity Catalog lineage is essential:
- Captures lineage at both table and column level across all languages (Python, SQL, Scala)
- Works cross-workspace, so your Azure and AWS pipelines are all visible (if connected to the same metastore)
- Visible in the Catalog Explorer UI with interactive upstream/downstream graphs
- Also available programmatically via system tables (system.access.table_lineage, system.access.column_lineage)
- Lineage data is retained for 1 year
Root cause workflow: When a downstream report shows bad data, trace upstream through the lineage graph to find which pipeline and transformation introduced the issue.
Docs: https://docs.databricks.com/data-governance/unity-catalog/data-lineage
LAYER 5: ALERTING AND NOTIFICATIONS
To make your observability system proactive (not just dashboards), set up Databricks SQL Alerts:
- Write SQL queries against system tables, event logs, or your data tables
- Set threshold conditions (e.g., "alert if job failure rate > 5% in last hour")
- Schedule them to run at regular intervals
- Route notifications to Email, Slack, PagerDuty, Microsoft Teams, or Webhooks
Example alert queries:
- Pipeline failures: Query system.lakeflow.job_run_timeline for failed runs
- Data freshness: Check if a table's last update timestamp exceeds your SLA
- Cost anomalies: Query system.billing.usage for unexpected DBU spikes
- Data quality: Query expectation results for spikes in failed records
Docs:
- SQL Alerts: https://docs.databricks.com/en/sql/user/alerts/index.html
- Notification Destinations: https://docs.databricks.com/en/admin/workspace-settings/notification-destinations.html
LAYER 6: METRIC VIEWS FOR STANDARDIZED KPIS
If you want to define standard data quality or operational KPIs that multiple teams can consume consistently, Metric Views are worth exploring:
- Define metrics centrally in YAML, registered in Unity Catalog
- Separate measures from dimensions so teams can slice/dice at query time
- Integrate with AI/BI dashboards and Genie spaces
- Support materialization for pre-computed aggregations
Docs: https://docs.databricks.com/metric-views/
PUTTING IT ALL TOGETHER
Here is the high-level architecture:
DATA SOURCES (500+ Pipelines)
|
v
[Layer 1] System Tables (system.lakeflow.*, system.billing.*, system.access.*)
--> Job/task health, cost, audit, lineage
[Layer 2] DLT Expectations + Event Logs
--> Data quality gates, violation tracking
[Layer 3] Lakehouse Monitoring
--> Data profiling, drift detection, anomaly detection
[Layer 4] Unity Catalog Lineage
--> Root cause tracing, impact analysis
|
v
OBSERVABILITY LAYER
+--> AI/BI Dashboards (operational views)
+--> SQL Alerts --> Slack / PagerDuty / Email / Teams
+--> Metric Views (standardized KPIs)
+--> (Optional) External tools via webhooks (Grafana, Datadog, etc.)
FOR CROSS-PLATFORM (AZURE + AWS)
Since you run on both clouds:
- Use Unity Catalog as your governance layer across both -- system tables and lineage will unify your view
- If the workspaces are in separate accounts, consider using Delta Sharing to share observability data between them
- The system tables approach works identically on both Azure and AWS
QUICK START RECOMMENDATION
If I were starting from scratch with 500+ pipelines, here is the order I would prioritize:
1. Enable system tables -- immediate visibility with zero pipeline changes
2. Set up SQL alerts on job failures and data freshness -- proactive notifications
3. Add DLT expectations to critical pipelines -- data quality gates
4. Enable Lakehouse Monitoring on your most important tables -- drift detection
5. Build dashboards combining system tables + event logs for a unified ops view
6. Define Metric Views for standardized health KPIs across teams
Hope this helps! Happy to go deeper on any specific layer.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.