I am able to configure pip's behavior w.r.t index url by setting PIP_INDEX_URL, PIP_TRUSTED_HOST etc.
I would like to cache compute-wide pypi libraries, to improve cluster startup performance / reliability.
However, I notice that PIP_CACHE_DIR has no effect. I also noticed that the library process cannot write to a volume, which is where I would like to store the cache.
I did this by setting PIP_REPORT=/Volume/path/to/my/volume, which triggered a permission denied error, here is an example:
Library installation attempted on the driver node of cluster 0519-144033-sht8s7z9 and failed due to an infrastructure issue caused by an invalid access token. Please contact Databricks support. Error code: FAULT_ACCESS_TOKEN_NOT_PROVIDED, error message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install '<internal package name>' --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 1] Operation not permitted: '/Volumes/<internal infra details>/pip/report.json'
I've tried to pass PIP_NO_CACHE_DIR=0 to make pip uses the cache (in case it is disabled by default), but this had no effect.
It does seem as indicated in https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installati... that PIP_NO_CACHE_DIR specifically is not passed to the process (other variables do get passed as established above).
Is there any way to achieve a persistent pip cache for compute-wide libraries ?