Hi everyone,
I’m running multiple real-time pipelines on Databricks using a single job that submits them via a thread pool. While most pipelines are running smoothly, I’ve noticed that a few of them occasionally get “stuck” or hang for several hours without raising any errors. This behavior results in data loss, which, as you can imagine, is a significant issue.
Here’s the situation in more detail:
Inconsistent Behavior:
Most pipelines function as expected, making it difficult to pinpoint when or which pipeline has hung until it causes data loss.
Detection Challenges:
Since the majority of pipelines keep running normally, our typical error-handling and retry mechanisms aren’t catching the “frozen” state in time.
Idle Event Monitoring:
I’m particularly interested in utilizing Databricks’ idle event management feature to proactively detect pipelines that have gone idle/hanging.
Could anyone share insights or best practices on the following points?
Configuration and Activation:
How can I configure the idle event monitoring feature on Databricks specifically for real-time pipelines?
Are there particular settings or parameters that are crucial for accurately identifying a hanging pipeline?
Monitoring and Alerting:
Thread Pool Considerations:
Could using a thread pool for managing multiple pipelines introduce certain risks that lead to these hanging issues?
Any best practices on resource allocation or per-pipeline timeouts when working with thread pools to mitigate such risks?
Integrating Auto-Restart Mechanisms:
I’ve already tried relying on Databricks’ default error-handling mechanisms, but the intermittent hanging issue remains unresolved. I’m eager to hear your thoughts on leveraging idle event management effectively and any other strategies or configurations that have worked for you.
Thanks in advance for your insights!
Regards,
Hung Nguyen