cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to setup alert and retry policy for specific pipeline?

minhhung0507
Contributor

Hi everyone,

Iā€™m running multiple real-time pipelines on Databricks using a single job that submit them via thread pool. Most of the pipelines work fine, but a few of them occasionally get stuck for several hours, causing data loss. The challenge is that because most pipelines continue running normally, itā€™s very hard to detect the stuck ones in time, and the usual error-handling or retry mechanisms donā€™t seem to catch this ā€œfrozenā€ state.

Does anyone have suggestions or best practices on how to detect, alert, and automatically retry (or recover) when one of these pipelines gets stuck? Iā€™d really appreciate any guidance on setting up proper alerts and retries for these frozen pipelines.

Thanks in advance for your help!

Regards,
Hung Nguyen
2 REPLIES 2

Brahmareddy
Honored Contributor II

Hi @minhhung0507 

How are you doing today?, as per understanding, It looks like some of your pipelines are getting stuck without failing, making it hard to detect them in time. A good way to handle this is by setting a timeout for each pipeline so that if it runs too long, it gets restarted automatically. You can also add a heartbeat check, where each pipeline updates a log or database with a timestamp while itā€™s runningā€”if a pipeline stops updating, you know itā€™s stuck and can trigger an alert or restart it. Using Databricks monitoring tools or cloud services like AWS CloudWatch or Azure Monitor can help track running jobs and send alerts if something is taking longer than expected. Also, make sure your thread pool isnā€™t overloaded, as too many pipelines running at once might be causing some to freeze. These steps should help you catch and fix stuck pipelines before they cause data loss.

Regards,

Brahma

minhhung0507
Contributor

Hi @Brahmareddy ,

Thanks a lot for your solution. We are currently using Databricks with GCP. We will try it and see if it solves our problem.

Regards,

Regards,
Hung Nguyen

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityā€”sign up today to get started!

Sign Up Now