<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Using pip cache for pypi compute libraries in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/127689#M3829</link>
    <description>&lt;P&gt;I am able to configure pip's behavior w.r.t index url by setting PIP_INDEX_URL, PIP_TRUSTED_HOST etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would like to cache compute-wide pypi libraries, to improve cluster startup performance / reliability.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, I notice that PIP_CACHE_DIR has no effect. I also noticed that the library process cannot write to a volume, which is where I would like to store the cache.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I did this by setting PIP_REPORT=/Volume/path/to/my/volume, which triggered a permission denied error, here is an example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;Library installation attempted on the driver node of cluster 0519-144033-sht8s7z9 and failed due to an infrastructure issue caused by an invalid access token. Please contact Databricks support. Error code: FAULT_ACCESS_TOKEN_NOT_PROVIDED, error message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install '&amp;lt;internal package name&amp;gt;' --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 1] Operation not permitted: '/Volumes/&amp;lt;internal infra details&amp;gt;/pip/report.json'&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've tried to pass PIP_NO_CACHE_DIR=0 to make pip uses the cache (in case it is disabled by default), but this had no effect.&lt;/P&gt;&lt;P&gt;It does seem as indicated in &lt;A href="https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installation-error-of-pypi/td-p/113704" target="_blank" rel="noopener"&gt;https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installation-error-of-pypi/td-p/113704&lt;/A&gt; that PIP_NO_CACHE_DIR specifically is not passed to the process (other variables &lt;SPAN&gt;do&lt;/SPAN&gt; get passed as established above).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any way to achieve a persistent pip cache for compute-wide libraries ?&lt;/P&gt;</description>
    <pubDate>Thu, 07 Aug 2025 15:47:24 GMT</pubDate>
    <dc:creator>spoltier</dc:creator>
    <dc:date>2025-08-07T15:47:24Z</dc:date>
    <item>
      <title>Using pip cache for pypi compute libraries</title>
      <link>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/127689#M3829</link>
      <description>&lt;P&gt;I am able to configure pip's behavior w.r.t index url by setting PIP_INDEX_URL, PIP_TRUSTED_HOST etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I would like to cache compute-wide pypi libraries, to improve cluster startup performance / reliability.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, I notice that PIP_CACHE_DIR has no effect. I also noticed that the library process cannot write to a volume, which is where I would like to store the cache.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I did this by setting PIP_REPORT=/Volume/path/to/my/volume, which triggered a permission denied error, here is an example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;Library installation attempted on the driver node of cluster 0519-144033-sht8s7z9 and failed due to an infrastructure issue caused by an invalid access token. Please contact Databricks support. Error code: FAULT_ACCESS_TOKEN_NOT_PROVIDED, error message: org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install '&amp;lt;internal package name&amp;gt;' --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 1] Operation not permitted: '/Volumes/&amp;lt;internal infra details&amp;gt;/pip/report.json'&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've tried to pass PIP_NO_CACHE_DIR=0 to make pip uses the cache (in case it is disabled by default), but this had no effect.&lt;/P&gt;&lt;P&gt;It does seem as indicated in &lt;A href="https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installation-error-of-pypi/td-p/113704" target="_blank" rel="noopener"&gt;https://community.databricks.com/t5/data-engineering/job-sometimes-failing-due-to-library-installation-error-of-pypi/td-p/113704&lt;/A&gt; that PIP_NO_CACHE_DIR specifically is not passed to the process (other variables &lt;SPAN&gt;do&lt;/SPAN&gt; get passed as established above).&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there any way to achieve a persistent pip cache for compute-wide libraries ?&lt;/P&gt;</description>
      <pubDate>Thu, 07 Aug 2025 15:47:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/127689#M3829</guid>
      <dc:creator>spoltier</dc:creator>
      <dc:date>2025-08-07T15:47:24Z</dc:date>
    </item>
    <item>
      <title>Re: Using pip cache for pypi compute libraries</title>
      <link>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/131175#M3991</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/177428"&gt;@spoltier&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P class=""&gt;If you want to avoid the issues with &lt;SPAN class=""&gt;PIP_CACHE_DIR&lt;/SPAN&gt; and the cache being lost on cluster restarts, my recommendation is to use a &lt;SPAN class=""&gt;&lt;STRONG&gt;custom Docker image&lt;/STRONG&gt;&lt;/SPAN&gt; with your libraries pre-installed. This is the easiest way to “install” dependencies consistently without having to re-download them every time the cluster starts.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN class=""&gt;That said, be aware that &lt;/SPAN&gt;&lt;STRONG&gt;not all Databricks features are available when using a custom Docker image&lt;/STRONG&gt;&lt;SPAN class=""&gt;. For example, you currently cannot use &lt;/SPAN&gt;&lt;STRONG&gt;Graviton instances&lt;/STRONG&gt;&lt;SPAN class=""&gt; or access &lt;/SPAN&gt;&lt;STRONG&gt;tables with data masking policies&lt;/STRONG&gt;&lt;SPAN class=""&gt; from custom images.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;You also haven’t specified the type of cluster you’re using, but I assume it’s an &lt;SPAN class=""&gt;&lt;STRONG&gt;All-Purpose&lt;/STRONG&gt; cluster&lt;/SPAN&gt;.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P class=""&gt;As an alternative (or complement), you can also use a service like &lt;SPAN class=""&gt;&lt;STRONG&gt;Nexus or Artifactory as a package proxy&lt;/STRONG&gt;&lt;/SPAN&gt;. This improves performance and reliability because dependencies are served from your cached repository instead of being downloaded repeatedly from PyPI or other external sources.&lt;BR /&gt;&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; Isi&lt;/P&gt;</description>
      <pubDate>Sun, 07 Sep 2025 19:36:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/131175#M3991</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-09-07T19:36:32Z</dc:date>
    </item>
    <item>
      <title>Re: Using pip cache for pypi compute libraries</title>
      <link>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/131191#M3994</link>
      <description>&lt;P&gt;Hi Isi,&lt;/P&gt;&lt;P&gt;We moved away from docker images for the reasons you mention, and because they otherwise had issues for us. We are already using artifactory (as hinted by the environment variables mentioned in my post). I wanted to try further improving the startup times. The approach suggested in other posts of putting wheel files on a volume somewhere and installing them manually seems hard to maintain and potentially unreliable.&lt;BR /&gt;&lt;BR /&gt;I was using both all purpose and job clusters, more all-purpose for ease of testing / iteration, but both would be required in the productive solution.&lt;BR /&gt;&lt;BR /&gt;I will consider your answer as a "&lt;STRONG&gt;No&lt;/STRONG&gt;" to my original question.&lt;BR /&gt;&lt;BR /&gt;In general, there should be clear documentation regarding which pip variables / configuration options can be used - some are empirically available, others not, but I couldn't find exhaustive documentation confirming that fact.&lt;/P&gt;</description>
      <pubDate>Mon, 08 Sep 2025 06:56:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/using-pip-cache-for-pypi-compute-libraries/m-p/131191#M3994</guid>
      <dc:creator>spoltier</dc:creator>
      <dc:date>2025-09-08T06:56:23Z</dc:date>
    </item>
  </channel>
</rss>

