cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to manage a dynamic scheduled job if an INTERNAL_ERROR occurs?

jeremy98
Honored Contributor

Hi community,

My team and I have been occasionally experiencing INTERNAL_ERROR events in Databricks. We have a job that runs on a schedule, but the start times vary. Sometimes, when the job is triggered, the underlying cluster fails to start for some reason.

Iโ€™d like some advice on how to better investigate these issues and how to set up a mitigation or fallback mechanism. Specifically, I want a way to detect when the job starts but the cluster cannot initialize, and then run an alternative process or alert.

Any suggestions or best practices would be greatly appreciated!

1 REPLY 1

SP_6721
Honored Contributor

Hi @jeremy98 ,

To investigate, check the Jobs UI for failed runs and review both error messages and cluster logs. Monitor failure trends over time and adjust cluster settings or quotas if needed.
https://docs.databricks.com/gcp/en/jobs/repair-job-failures

For detection, enable job notifications for โ€œon failureโ€ events in Job settings.
https://docs.databricks.com/gcp/en/jobs/notifications

For fallback, add a downstream task in the job configured to run only if failed.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now