Learnings while troubleshooting Random Databricks Job Failures After Cloud Migration

tushar_sable — Mon, 08 Jun 2026 07:34:28 GMT

One of my learning from a project.
After migrating from an on-premises environment to the cloud, the data engineering team began noticing seemingly random failures in workflow-scheduled Databricks jobs.

The failures appeared intermittent and often succeeded when the job was repaired and rerun, making root cause analysis particularly challenging.

## The Error

The jobs were failing with the following message:

Reason: AZURE_QUOTA_EXCEEDED_EXCEPTION (CLIENT_ERROR)
Parameters:
databricks_error_message:
The VM size you are specifying is not available.

At first glance, the error pointed toward Azure infrastructure capacity or quota limitations. However, the random nature of the failures and the fact that repair runs frequently succeeded suggested there might be more to the story.

## Investigating the Root Cause

A review of the affected notebooks revealed a common pattern in the initialization code:

python
%pip install package1
%pip install package2
%pip install package3
dbutils.library.restartPython()

The notebooks were installing dependencies using multiple `%pip install` commands and then explicitly calling dbutils.library.restartPython()

### Why This Causes Problems

In an interactive notebook environment, `dbutils.library.restartPython()` is often used alongside `%pip install` to reload the Python environment after package installation.

However, Databricks Jobs behave differently.

When executed as part of a scheduled job, `dbutils.library.restartPython()` forcibly terminates the running Python process. The job runner interprets this abrupt termination as an unexpected shutdown and marks the execution as **Cancelled**, resulting in job failure.

## A Secondary Stability Issue

Another contributing factor was the use of multiple `%pip install` cells.

Each `%pip install` command can trigger its own environment restart. When several installation commands are executed separately, the notebook may experience multiple restarts during initialization, increasing execution overhead and introducing instability into scheduled runs.

## The Fix

The solution was straightforward:

1. Remove all `dbutils.library.restartPython()` calls from notebooks running as Databricks Jobs.
2. Consolidate package installations into a single `%pip install` statement whenever possible.

Instead of:

python
%pip install package1
%pip install package2
%pip install package3
dbutils.library.restartPython()

Use:

python
%pip install package1 package2 package3

## Key Takeaway

Not every failure message points directly to the root cause. Although the jobs reported an Azure quota-related exception, the underlying issue was notebook initialization logic that caused unexpected Python process termination.

When running Databricks notebooks as scheduled jobs:

* Avoid using `dbutils.library.restartPython()`.
* Consolidate package installations into a single `%pip install` command.
* Minimize unnecessary environment restarts during job startup.

These small changes can significantly improve job stability and eliminate difficult-to-diagnose intermittent failures.

topic Learnings while troubleshooting Random Databricks Job Failures After Cloud Migration in Community Articles

Learnings while troubleshooting Random Databricks Job Failures After Cloud Migration