Hi everyone,
Iām running multiple real-time pipelines on Databricks using a single job that submit them via thread pool. Most of the pipelines work fine, but a few of them occasionally get stuck for several hours, causing data loss. The challenge is that because most pipelines continue running normally, itās very hard to detect the stuck ones in time, and the usual error-handling or retry mechanisms donāt seem to catch this āfrozenā state.
Does anyone have suggestions or best practices on how to detect, alert, and automatically retry (or recover) when one of these pipelines gets stuck? Iād really appreciate any guidance on setting up proper alerts and retries for these frozen pipelines.
Thanks in advance for your help!
Regards,
Hung Nguyen