Observability and monitoring accross multiple workspaces(both job clusters and serverless compute)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2025 08:56 AM
Hi all,
Today what are the best option available today for observability and monitoring databricks jobs accross all workspaces. We have 100 of workspaces and it hard to do monitoring to check failed and successeded jobs.
We tried using:
1. Team webhook to notify ourselves if there are any errors but its not very scalable
2. Grafana and Datadog but they are limited with init script which is no more the option on serverless compute.
3. System tables(compute and job timeline) but they lack the capability of showing resource usage metrics.
4. Databricks Workflow UI : its limited to one workspace so not scalable.
What we want to have:
1. Overview of Jobs failed or success across all workspaces
2. Get failure alerts and easy to navigate to application logs.
3. Good to have email alerts.
4. Its supports serverless compute.
Thanks in advance!
Best REgards,
sunny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2025 08:46 PM
Hi,
How are you doing today? , As per my understanding, It sounds like you need a centralized observability solution that works across multiple Databricks workspaces, supports serverless compute, and provides alerts and detailed logs. Since webhooks and Databricks system tables have limitations, you might consider Databricks Audit Logs + a centralized monitoring dashboard. Audit logs (enabled via AWS CloudTrail or Azure Monitor) can capture job status across all workspaces, and you can process them using a Lakehouse approach—storing logs in a Delta table and querying them with Databricks SQL. For real-time monitoring, you could set up a Databricks Job that periodically aggregates job statuses and pushes them to a tool like Prometheus, Grafana, or a custom dashboard. To get alerts, consider Databricks Alerts (SQL Alerts), CloudWatch (AWS), or Azure Monitor with email notifications. If you want deeper visibility, third-party tools like Monte Carlo, Acceldata, or Unravel offer better observability and cost insights. Let me know if you want help setting up one of these solutions!
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hey Brahmareddy, Thanks so much for responding. Sorry I forgot to mention we are in Azure. Lets go through one by one.
1. Audit logs(Azure Monitor) : AFAIK this requires init scripts and jar build that will not support in serverless or its not the case?
2. Scheduling job to push logs : This does not sounds scalable for us as we need to run the job on all workspaces and maintain each and then overhead of now also monitor those 😞
3. Monte Carlo, Acceldata, or Unravel : Sounds interesting and if they support serverless that would be awesome.
Thanks a lot. Looking forward
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi sparkycloud,
Thanks for your reply. You're right to be thinking about these limitations, especially with serverless. For audit logs on Azure, the older method with init scripts and JARs won’t work on serverless, but you can still use Azure Monitor’s diagnostic settings at the workspace level to push logs to Log Analytics, Event Hub, or Storage, and that works fine even with serverless since it’s handled outside the cluster. As for scheduling jobs across workspaces, I agree—it’s not scalable to maintain jobs in every workspace just to collect logs. You could script it using APIs from a central place, but it’s still overhead. Tools like Monte Carlo, Acceldata, or Unravel are definitely worth looking into, and yes, many of them support serverless and Unity Catalog, since they work through metadata or APIs rather than depending on your cluster type. If you want less operational headache and more visibility across workspaces, those tools are a solid long-term bet. Let me know if you want help picking or testing one!
Regards,
Brahma

