cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Handling Hanging Pipelines in Real-Time Environments: Leveraging Databricksโ€™ Idle Event Monitoring

minhhung0507
Valued Contributor

Hi everyone,

Iโ€™m running multiple real-time pipelines on Databricks using a single job that submits them via a thread pool. While most pipelines are running smoothly, Iโ€™ve noticed that a few of them occasionally get โ€œstuckโ€ or hang for several hours without raising any errors. This behavior results in data loss, which, as you can imagine, is a significant issue.

Hereโ€™s the situation in more detail:

  • Inconsistent Behavior:
    Most pipelines function as expected, making it difficult to pinpoint when or which pipeline has hung until it causes data loss.

  • Detection Challenges:
    Since the majority of pipelines keep running normally, our typical error-handling and retry mechanisms arenโ€™t catching the โ€œfrozenโ€ state in time.

  • Idle Event Monitoring:
    Iโ€™m particularly interested in utilizing Databricksโ€™ idle event management feature to proactively detect pipelines that have gone idle/hanging.

Could anyone share insights or best practices on the following points?

  1. Configuration and Activation:

    • How can I configure the idle event monitoring feature on Databricks specifically for real-time pipelines?

    • Are there particular settings or parameters that are crucial for accurately identifying a hanging pipeline?

  2. Monitoring and Alerting:

    • What approaches do you use for real-time monitoring of each pipeline?

    • Do you have recommendations for setting up automatic alerts or dashboards that can catch idle or stuck pipelines immediately?

  3. Thread Pool Considerations:

    • Could using a thread pool for managing multiple pipelines introduce certain risks that lead to these hanging issues?

    • Any best practices on resource allocation or per-pipeline timeouts when working with thread pools to mitigate such risks?

  4. Integrating Auto-Restart Mechanisms:

    • Once a pipeline is detected to be hanging, whatโ€™s the best way to integrate an auto-restart or redirect mechanism to avoid data loss?

Iโ€™ve already tried relying on Databricksโ€™ default error-handling mechanisms, but the intermittent hanging issue remains unresolved. Iโ€™m eager to hear your thoughts on leveraging idle event management effectively and any other strategies or configurations that have worked for you.

Thanks in advance for your insights!

Regards,
Hung Nguyen
1 REPLY 1

-werners-
Esteemed Contributor III

may I ask why you use threadpools?  with jobs you can define multiple tasks which do the same.
I'm asking because threadpools and spark resource management can intervene with each other.