Using a proxy server to install packages from PyPI in Azure Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sunday
Hi,
I'm setting up a workspace in Azure and would like to put some restrictions in place on outbound Internet access to reduce the risk of data exfiltration from notebooks and jobs. I plan to use VNet Injection and SCC + back-end private link for compute to control plane traffic. I understand that means the compute subnets can be set up without direct outbound Internet access.
I've seen guides like https://www.databricks.com/blog/data-exfiltration-protection-with-azure-databricks where a network virtual appliance like Azure Firewall is used to allow traffic to certain domains (.pypi.org, .pythonhosted.org, etc.).
As an alternative to Azure Firewall, I'd like to use an explicit HTTP proxy to better align with other infrastructure. I know in general pip can work behind a proxy if the http_proxy / https_proxy environment variables are set.
- Is there a way to configure a compute cluster to use an HTTP proxy for installing libraries? In particular, I'm interested in making it easy for a user to install notebook-scoped Python libraries from PyPI using a normal %pip command. Is there something I could do in a cluster-scoped init script to set the environment variables http_proxy and https_proxy so they're available to notebooks? Would I need to add anything to no_proxy, to allow normal connections to the control plane via the back-end private link?
- Are there other outbound connections needed for normal job / notebook execution, other than package repositories like PyPI?
- If a user clones a repo from GitHub in the workspace UI, is that traffic coming from the control plane or compute plane?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
Hey @mzs ,
If I understood correctly, you want to configure a Databricks compute cluster to use an HTTP proxy for installing libraries via %pip install, instead of using Azure Firewall.
Yes, this should be possible by setting the http_proxy and https_proxy environment variables in an init script. This way, any request from the compute plane (like installing packages from PyPI) will go through the proxy.
You can try adding the following init script to your cluster:
#!/bin/bash
echo "export http_proxy=http://<proxy-address>:<port>" >> /etc/environment
echo "export https_proxy=http://<proxy-address>:<port>" >> /etc/environment
echo "export NO_PROXY=169.254.169.254,*.azuredatabricks.net,*.blob.core.windows.net,*.dfs.core.windows.net,*.table.core.windows.net,*.queue.core.windows.net,*.service.signalr.net" >> /etc/environment
source /etc/environment
•%pip install uses the proxy automatically.
•Internal traffic to Azure services and the Databricks control plane still works (via NO_PROXY).
I’ve never tested this exact setup before, so if you try it out, I’d really appreciate it if you could share your results.
Hope this helps 🙂
Isi

