Databricks Community

minhhung0507 · ‎02-25-2025

Hi everyone,

I’m running multiple real-time pipelines on Databricks using a single job that submit them via thread pool. Most of the pipelines work fine, but a few of them occasionally get stuck for several hours, causing data loss. The challenge is that because most pipelines continue running normally, it’s very hard to detect the stuck ones in time, and the usual error-handling or retry mechanisms don’t seem to catch this “frozen” state.

Does anyone have suggestions or best practices on how to detect, alert, and automatically retry (or recover) when one of these pipelines gets stuck? I’d really appreciate any guidance on setting up proper alerts and retries for these frozen pipelines.

Thanks in advance for your help!

Regards,
Hung Nguyen

Brahmareddy · ‎03-10-2025

Hi @minhhung0507

How are you doing today?, as per understanding, It looks like some of your pipelines are getting stuck without failing, making it hard to detect them in time. A good way to handle this is by setting a timeout for each pipeline so that if it runs too long, it gets restarted automatically. You can also add a heartbeat check, where each pipeline updates a log or database with a timestamp while it’s running—if a pipeline stops updating, you know it’s stuck and can trigger an alert or restart it. Using Databricks monitoring tools or cloud services like AWS CloudWatch or Azure Monitor can help track running jobs and send alerts if something is taking longer than expected. Also, make sure your thread pool isn’t overloaded, as too many pipelines running at once might be causing some to freeze. These steps should help you catch and fix stuck pipelines before they cause data loss.

Regards,

Brahma