10-02-2024 07:03 AM
Hello all,
I have a slightly niche issue here, albeit one that others are likely to run into.
Using databricks on Azure, my organisation has included extended our WAN into the cloud, so that all compute clusters are granted a private IP address that can access on-prem servers (using vnet injection). One of those servers is a http/https proxy, through which all our data to non-azure systems should be trafficked. This is achieved through SCC and private vnets.
Recently, to permit the installation of libraries from pypi, I added the following to clusters' environmental variables: http_proxy, HTTP_PROXY, https_proxy, HTTPS_PROXY. That allowed for installation from pypi on boot when adding the library to the cluster libraries (I am aware that pip --proxy can work in a notebook).
However, I've since discovered that is breaking python's access to volumes mounted in unity catalog. My volume is an azure storage account container. The proxy server is a whitelisted IP in the azure storage account (it's a public ip4). I am account admin, workplace admin, and have manually granted myself all privileges on the catalog level. Compute is a single user compute running DRV 15.4 LTS with config
spark.databricks.cluster.profile singleNode
spark.master local[*]
The below code works as expected regardless of the environmental variables:
path = "/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json"
dbutils.fs.head(path)
The following code functions as expected (reading the file) without the variables set,
with open(path, "r") as fp:
json.load(fp)
but throws the following error when the proxy variables are set
PermissionError: [Errno 13] Permission denied: '/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json'
I would really like to keep the proxy environment variables to catch all the traffic to the public internet seamlessly (e.g. library installs). What I'm hoping can be answered:
3 weeks ago
A solution that worked, in addition to having the HTTP_PROXY and HTTPS_PROXY variables set globally, was to add the following definition to the compute policy:
10-08-2025 08:54 AM
Bumping this as I am having the same issue.
Is the solution to just not define the proxy vars globally?
Is there something to add in the NO_PROXY or the spark_conf to allow for communication within databricks between storage accounts to not go through the proxy?
Have already tried adding in the storage accounts of use and the databricks workspace to the NO_PROXY, as well as adding the java options for the driver and executor.
10-08-2025 09:07 AM
Unfortunately, the only solution I found was to not use the proxy globally. Good luck!
3 weeks ago
A solution that worked, in addition to having the HTTP_PROXY and HTTPS_PROXY variables set globally, was to add the following definition to the compute policy:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now