topic Using pip cache for pypi compute libraries in Administration & Architecture

Using pip cache for pypi compute libraries

spoltier — Thu, 07 Aug 2025 15:47:24 GMT

I am able to configure pip's behavior w.r.t index url by setting PIP_INDEX_URL, PIP_TRUSTED_HOST etc.

I would like to cache compute-wide pypi libraries, to improve cluster startup performance / reliability.

However, I notice that PIP_CACHE_DIR has no effect. I also noticed that the library process cannot write to a volume, which is where I would like to store the cache.

I did this by setting PIP_REPORT=/Volume/path/to/my/volume, which triggered a permission denied error, here is an example:

Library installation attempted on the driver node of cluster 0519-144033-sht8s7z9 and failed due to an infrastructure issue caused by an invalid access token. Please contact Databricks support. Error code: FAULT_ACCESS_TOKEN_NOT_PROVIDED, error message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install '<internal package name>' --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 1] Operation not permitted: '/Volumes/<internal infra details>/pip/report.json'

I've tried to pass PIP_NO_CACHE_DIR=0 to make pip uses the cache (in case it is disabled by default), but this had no effect.

It does seem as indicated in https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installation-error-of-pypi/td-p/113704 that PIP_NO_CACHE_DIR specifically is not passed to the process (other variables do get passed as established above).

Is there any way to achieve a persistent pip cache for compute-wide libraries ?

Re: Using pip cache for pypi compute libraries

Isi — Sun, 07 Sep 2025 19:36:32 GMT

Hey @spoltier

If you want to avoid the issues with PIP_CACHE_DIR and the cache being lost on cluster restarts, my recommendation is to use a custom Docker image with your libraries pre-installed. This is the easiest way to “install” dependencies consistently without having to re-download them every time the cluster starts.

That said, be aware that not all Databricks features are available when using a custom Docker image. For example, you currently cannot use Graviton instances or access tables with data masking policies from custom images.

You also haven’t specified the type of cluster you’re using, but I assume it’s an All-Purpose cluster.

As an alternative (or complement), you can also use a service like Nexus or Artifactory as a package proxy. This improves performance and reliability because dependencies are served from your cached repository instead of being downloaded repeatedly from PyPI or other external sources.

🙂 Isi

Re: Using pip cache for pypi compute libraries

spoltier — Mon, 08 Sep 2025 06:56:23 GMT

Hi Isi,

We moved away from docker images for the reasons you mention, and because they otherwise had issues for us. We are already using artifactory (as hinted by the environment variables mentioned in my post). I wanted to try further improving the startup times. The approach suggested in other posts of putting wheel files on a volume somewhere and installing them manually seems hard to maintain and potentially unreliable.

I was using both all purpose and job clusters, more all-purpose for ease of testing / iteration, but both would be required in the productive solution.

I will consider your answer as a "No" to my original question.

In general, there should be clear documentation regarding which pip variables / configuration options can be used - some are empirically available, others not, but I couldn't find exhaustive documentation confirming that fact.