How to setup alert and retry policy for specific pipeline?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā02-25-2025 01:24 AM
Hi everyone,
Iām running multiple real-time pipelines on Databricks using a single job that submit them via thread pool. Most of the pipelines work fine, but a few of them occasionally get stuck for several hours, causing data loss. The challenge is that because most pipelines continue running normally, itās very hard to detect the stuck ones in time, and the usual error-handling or retry mechanisms donāt seem to catch this āfrozenā state.
Does anyone have suggestions or best practices on how to detect, alert, and automatically retry (or recover) when one of these pipelines gets stuck? Iād really appreciate any guidance on setting up proper alerts and retries for these frozen pipelines.
Thanks in advance for your help!
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
How are you doing today?, as per understanding, It looks like some of your pipelines are getting stuck without failing, making it hard to detect them in time. A good way to handle this is by setting a timeout for each pipeline so that if it runs too long, it gets restarted automatically. You can also add a heartbeat check, where each pipeline updates a log or database with a timestamp while itās runningāif a pipeline stops updating, you know itās stuck and can trigger an alert or restart it. Using Databricks monitoring tools or cloud services like AWS CloudWatch or Azure Monitor can help track running jobs and send alerts if something is taking longer than expected. Also, make sure your thread pool isnāt overloaded, as too many pipelines running at once might be causing some to freeze. These steps should help you catch and fix stuck pipelines before they cause data loss.
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi @Brahmareddy ,
Thanks a lot for your solution. We are currently using Databricks with GCP. We will try it and see if it solves our problem.
Regards,
Hung Nguyen

