Hello all,
I have a slightly niche issue here, albeit one that others are likely to run into.
Using databricks on Azure, my organisation has included extended our WAN into the cloud, so that all compute clusters are granted a private IP address that can access on-prem servers (using vnet injection). One of those servers is a http/https proxy, through which all our data to non-azure systems should be trafficked. This is achieved through SCC and private vnets.
Recently, to permit the installation of libraries from pypi, I added the following to clusters' environmental variables: http_proxy, HTTP_PROXY, https_proxy, HTTPS_PROXY. That allowed for installation from pypi on boot when adding the library to the cluster libraries (I am aware that pip --proxy can work in a notebook).
However, I've since discovered that is breaking python's access to volumes mounted in unity catalog. My volume is an azure storage account container. The proxy server is a whitelisted IP in the azure storage account (it's a public ip4). I am account admin, workplace admin, and have manually granted myself all privileges on the catalog level. Compute is a single user compute running DRV 15.4 LTS with config
spark.databricks.cluster.profile singleNode
spark.master local[*]
The below code works as expected regardless of the environmental variables:
path = "/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json"
dbutils.fs.head(path)
The following code functions as expected (reading the file) without the variables set,
with open(path, "r") as fp:
json.load(fp)
but throws the following error when the proxy variables are set
PermissionError: [Errno 13] Permission denied: '/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json'
I would really like to keep the proxy environment variables to catch all the traffic to the public internet seamlessly (e.g. library installs). What I'm hoping can be answered:
- Why does volume access via dbutils/browsing unity catalog GUI work on the compute when the environmental variables are set, but vanilla python does not?
- Are there any suggested workarounds to allow python to interact with the filesystem when the environment variables are set?
- If I cannot set the http/s proxy environment variables, are there any other variables or spark config that could get the cluster to access pypi/maven etc via a proxy by default?