Using a proxy server to install packages from PyPI in Azure Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hi,
I'm setting up a workspace in Azure and would like to put some restrictions in place on outbound Internet access to reduce the risk of data exfiltration from notebooks and jobs. I plan to use VNet Injection and SCC + back-end private link for compute to control plane traffic. I understand that means the compute subnets can be set up without direct outbound Internet access.
I've seen guides like https://www.databricks.com/blog/data-exfiltration-protection-with-azure-databricks where a network virtual appliance like Azure Firewall is used to allow traffic to certain domains (.pypi.org, .pythonhosted.org, etc.).
As an alternative to Azure Firewall, I'd like to use an explicit HTTP proxy to better align with other infrastructure. I know in general pip can work behind a proxy if the http_proxy / https_proxy environment variables are set.
- Is there a way to configure a compute cluster to use an HTTP proxy for installing libraries? In particular, I'm interested in making it easy for a user to install notebook-scoped Python libraries from PyPI using a normal %pip command. Is there something I could do in a cluster-scoped init script to set the environment variables http_proxy and https_proxy so they're available to notebooks? Would I need to add anything to no_proxy, to allow normal connections to the control plane via the back-end private link?
- Are there other outbound connections needed for normal job / notebook execution, other than package repositories like PyPI?
- If a user clones a repo from GitHub in the workspace UI, is that traffic coming from the control plane or compute plane?
Thanks!

