topic Re: Installing libraries on job clusters using tasks dependencies is not reliable in case of repairs in Administration & Architecture

Installing libraries on job clusters using tasks dependencies is not reliable in case of repairs

aliz — Wed, 28 Jan 2026 11:54:55 GMT

Hello,

Following the suggestion on this thread, for job clusters we install the libraries only on the first task of the workflow, which are then made available to the subsequent tasks.
However, this method is not reliable in the case of run repairs: the state of the cluster is not recovered, and therefore libraries are not installed because only the tasks following the failure are executed. This means the task containing the dependencies is not re-executed.

Unfortunately, attaching the dependencies to each and every task of the workflow is not an option, since the libraries seem to be reinstalled every time, leading to an increase in workflow execution time proportional to the number of tasks.

Are there any common solution to this problem?

Re: Installing libraries on job clusters using tasks dependencies is not reliable in case of repairs

iyashk-DB — Fri, 30 Jan 2026 17:54:52 GMT

Hi @aliz ,
My suggestion would be not to rely on this. When you repair a run, Databricks creates a new job cluster and only re-runs the selected subset of tasks, so your setup task is skipped and the fresh cluster is missing dependencies.

Some quick common solutions that I can think of right now are as follows:

Use task-scoped Dependent libraries for the tasks that need them. These libraries are installed before the task begins and are then available to subsequent tasks on the same job cluster during the run.
Databricks does not allow declaring libraries in a shared job-cluster spec; they must be attached to tasks. To reduce install time, pull Python wheels or a requirements.txt from Unity Catalog Volumes or Workspace files (local fast read, no external network), not from remote package indexes.
Prefer Serverless Jobs with Environments for Python dependencies. Declare your requirements once (environment panel or base-environment YAML), and Databricks caches the virtual environment across tasks in the run, so subsequent tasks don’t reinstall the same packages. This materially speeds up multi-task jobs and avoids your repair-run gap.
When repairing a run, re-run the setup/dependency task as part of the repair:

In the UI “Repair run” pane, include the successful setup task so its libraries are reinstalled on the new cluster.
Via CLI/API, use jobs repair-run with --rerun-tasks <setup_task_key> (and optionally --rerun-dependent-tasks) to explicitly re-run that task even if it previously succeeded.

If you must keep “install at cluster start,” use init scripts to install from UC Volumes/Workspace files (fast, controlled sources). Init scripts run on cluster creation; because repair runs create a new job cluster, the libraries will be installed again automatically. Note Databricks generally recommends using task libraries or environments over init scripts; weigh operational trade-offs.
As an alternative architecture, for long-lived pipelines with stable dependencies, consider running on an all‑purpose shared cluster with compute-scoped libraries installed once at cluster level, so all job tasks share the same environment and repair runs don’t rely on a setup task.

Re: Installing libraries on job clusters using tasks dependencies is not reliable in case of repairs

saurabh18cs — Mon, 09 Feb 2026 11:45:30 GMT

Hi @aliz Either go serverless as this will reduce cluster starup time or keep controlling your compute via job cluster but specify libraries for all tasks.

Re: Installing libraries on job clusters using tasks dependencies is not reliable in case of repairs

SteveOstrowski — Sun, 08 Mar 2026 07:21:40 GMT

Hi @aliz,

This is a common pattern to run into when using a dedicated "setup" task for library installation on shared job clusters. The core issue is that repair runs provision a fresh job cluster and only re-execute the failed (or selected) tasks, so your setup task is skipped and the new cluster starts without the required libraries.

Here are several approaches to solve this reliably, ordered from most recommended to least:

OPTION 1: USE TASK-LEVEL DEPENDENT LIBRARIES (RECOMMENDED)

Rather than relying on a single setup task, attach the required libraries directly to each task that needs them. You can do this in the task configuration under "Dependent libraries" in the Jobs UI, or via the "libraries" field in the Jobs API/CLI.

A concern you raised is that re-installing on every task is slow. To mitigate this:

- Host your Python wheels or requirements.txt in Unity Catalog Volumes or Workspace Files. These are local reads with no external network fetch, making installation significantly faster than pulling from PyPI or other remote indexes.
- Pre-build a single wheel that bundles all your dependencies together. This reduces the number of install operations per task to one.
- If you have many tasks sharing the same job cluster, libraries installed by the first task that runs will already be present for subsequent tasks in the same run (they share the same cluster). You only need the dependent libraries declared on each task as a safety net for repair scenarios.

Documentation: https://docs.databricks.com/aws/en/libraries

OPTION 2: USE SERVERLESS COMPUTE WITH ENVIRONMENTS

If your workloads support serverless compute, this is the cleanest approach. With serverless jobs, you define an Environment that specifies your Python dependencies (via PyPI packages or a requirements.txt). Databricks caches the resolved virtual environment, so subsequent tasks in the same run do not reinstall packages. This caching persists across repair runs as well, making dependency management seamless.

To set this up:
1. In the job configuration, select Serverless as the compute type.
2. Define an Environment with your required packages.
3. All tasks using that environment share the cached dependencies.

Documentation: https://docs.databricks.com/aws/en/jobs

OPTION 3: USE INIT SCRIPTS ON THE JOB CLUSTER

If you need to stay on classic job clusters and cannot attach libraries to every task, you can use cluster-scoped init scripts. Init scripts are shell scripts that execute automatically during cluster startup, before any tasks run. Since a repair run creates a new job cluster, the init script will run again on that new cluster, ensuring libraries are installed.

Example init script (save to a Unity Catalog Volume):

#!/bin/bash
/databricks/python/bin/pip install your-package-1 your-package-2

Then reference the script in your job cluster configuration under "Init Scripts."

Important considerations:
- Databricks recommends using task libraries or environments over init scripts when possible.
- Store init scripts in Unity Catalog Volumes or Workspace Files (DBFS-based init scripts are deprecated in Runtime 15.1+).
- Init scripts add to cluster startup time, but they guarantee libraries are present regardless of which tasks are executed.

Documentation: https://docs.databricks.com/aws/en/init-scripts

OPTION 4: INCLUDE THE SETUP TASK IN REPAIR RUNS

As a workaround with your current architecture, when you trigger a repair run, you can explicitly include the setup task even if it previously succeeded:

- In the UI: In the "Repair run" dialog, check the box next to your setup/library installation task so it re-executes on the new cluster.
- Via API/CLI: Use the repair run endpoint and include the setup task key in the list of tasks to re-run, along with the --rerun-dependent-tasks flag if applicable.

This is the least automated option since it requires manual intervention (or custom API logic) on each repair, but it works without changing your job architecture.

SUMMARY

For long-term reliability, Option 1 (task-level dependent libraries with packages hosted in UC Volumes) or Option 2 (serverless with environments) will give you the most robust behavior during both normal runs and repairs. Init scripts (Option 3) are a solid fallback if you need libraries at the cluster level.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.