Hi everyone,
Iโm running multiple real-time pipelines on Databricks using a single job that submit them via thread pool. Most of the pipelines work fine, but a few of them occasionally get stuck for several hours, causing data loss. The challenge is that because most pipelines continue running normally, itโs very hard to detect the stuck ones in time, and the usual error-handling or retry mechanisms donโt seem to catch this โfrozenโ state.
Does anyone have suggestions or best practices on how to detect, alert, and automatically retry (or recover) when one of these pipelines gets stuck? Iโd really appreciate any guidance on setting up proper alerts and retries for these frozen pipelines.
Thanks in advance for your help!
Regards,
Hung Nguyen