cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog Volume mounting broken by cluster environment variables (http proxy)

Seb_G
New Contributor

Hello all,
I have a slightly niche issue here, albeit one that others are likely to run into.

Using databricks on Azure, my organisation has included extended our WAN into the cloud, so that all compute clusters are granted a private IP address that can access on-prem servers (using vnet injection). One of those servers is a http/https proxy, through which all our data to non-azure systems should be trafficked. This is achieved through SCC and private vnets.

Recently, to permit the installation of libraries from pypi, I added the following to clusters' environmental variables: http_proxy, HTTP_PROXY, https_proxy, HTTPS_PROXY. That allowed for installation from pypi on boot when adding the library to the cluster libraries (I am aware that pip --proxy can work in a notebook).

However, I've since discovered that is breaking python's access to volumes mounted in unity catalog. My volume is an azure storage account container. The proxy server is a whitelisted IP in the azure storage account (it's a public ip4). I am account admin, workplace admin, and have manually granted myself all privileges on the catalog level. Compute is a single user compute running DRV 15.4 LTS with config

 

spark.databricks.cluster.profile singleNode
spark.master local[*]

 

The below code works as expected regardless of the environmental variables:

 

path = "/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json"
dbutils.fs.head(path)

 

The following code functions as expected (reading the file) without the variables set,

 

with open(path, "r") as fp:
    json.load(fp)

 

but throws the following error when the proxy variables are set

 

PermissionError: [Errno 13] Permission denied: '/Volumes/<catalog>/<schema>/<volume>/<folder1>/<folder2>/<folder3>/<file>.json'

 

I would really like to keep the proxy environment variables to catch all the traffic to the public internet seamlessly (e.g. library installs). What I'm hoping can be answered:

  • Why does volume access via dbutils/browsing unity catalog GUI work on the compute when the environmental variables are set, but vanilla python does not?
  • Are there any suggested workarounds to allow python to interact with the filesystem when the environment variables are set?
  • If I cannot set the http/s proxy environment variables, are there any other variables or spark config that could get the cluster to access pypi/maven etc via a proxy by default?
0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group