Databricks Community

minhhung0507 · ‎04-15-2025

Hi everyone,

I’m running multiple real-time pipelines on Databricks using a single job that submits them via a thread pool. While most pipelines are running smoothly, I’ve noticed that a few of them occasionally get “stuck” or hang for several hours without raising any errors. This behavior results in data loss, which, as you can imagine, is a significant issue.

Here’s the situation in more detail:

Inconsistent Behavior:
Most pipelines function as expected, making it difficult to pinpoint when or which pipeline has hung until it causes data loss.
Detection Challenges:
Since the majority of pipelines keep running normally, our typical error-handling and retry mechanisms aren’t catching the “frozen” state in time.
Idle Event Monitoring:
I’m particularly interested in utilizing Databricks’ idle event management feature to proactively detect pipelines that have gone idle/hanging.

Could anyone share insights or best practices on the following points?

Configuration and Activation:
- How can I configure the idle event monitoring feature on Databricks specifically for real-time pipelines?
- Are there particular settings or parameters that are crucial for accurately identifying a hanging pipeline?
Monitoring and Alerting:
- What approaches do you use for real-time monitoring of each pipeline?
- Do you have recommendations for setting up automatic alerts or dashboards that can catch idle or stuck pipelines immediately?
Thread Pool Considerations:
- Could using a thread pool for managing multiple pipelines introduce certain risks that lead to these hanging issues?
- Any best practices on resource allocation or per-pipeline timeouts when working with thread pools to mitigate such risks?
Integrating Auto-Restart Mechanisms:
- Once a pipeline is detected to be hanging, what’s the best way to integrate an auto-restart or redirect mechanism to avoid data loss?

I’ve already tried relying on Databricks’ default error-handling mechanisms, but the intermittent hanging issue remains unresolved. I’m eager to hear your thoughts on leveraging idle event management effectively and any other strategies or configurations that have worked for you.

Thanks in advance for your insights!

Regards,
Hung Nguyen

-werners- · ‎04-16-2025

may I ask why you use threadpools? with jobs you can define multiple tasks which do the same.
I'm asking because threadpools and spark resource management can intervene with each other.

Databricks Community

Handling Hanging Pipelines in Real-Time Environments: Leveraging Databricks’ Idle Event Monitoring

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST