Hi everyone,
Iโm running multiple real-time pipelines on Databricks using a single job that submits them via a thread pool. While most pipelines are running smoothly, Iโve noticed that a few of them occasionally get โstuckโ or hang for several hours without raising any errors. This behavior results in data loss, which, as you can imagine, is a significant issue.
Hereโs the situation in more detail:
Inconsistent Behavior:
Most pipelines function as expected, making it difficult to pinpoint when or which pipeline has hung until it causes data loss.
Detection Challenges:
Since the majority of pipelines keep running normally, our typical error-handling and retry mechanisms arenโt catching the โfrozenโ state in time.
Idle Event Monitoring:
Iโm particularly interested in utilizing Databricksโ idle event management feature to proactively detect pipelines that have gone idle/hanging.
Could anyone share insights or best practices on the following points?
Configuration and Activation:
How can I configure the idle event monitoring feature on Databricks specifically for real-time pipelines?
Are there particular settings or parameters that are crucial for accurately identifying a hanging pipeline?
Monitoring and Alerting:
Thread Pool Considerations:
Could using a thread pool for managing multiple pipelines introduce certain risks that lead to these hanging issues?
Any best practices on resource allocation or per-pipeline timeouts when working with thread pools to mitigate such risks?
Integrating Auto-Restart Mechanisms:
Iโve already tried relying on Databricksโ default error-handling mechanisms, but the intermittent hanging issue remains unresolved. Iโm eager to hear your thoughts on leveraging idle event management effectively and any other strategies or configurations that have worked for you.
Thanks in advance for your insights!
Regards,
Hung Nguyen