cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

virtual environment on azure databricks compute cluster

jole3112
New Contributor III

I'm using Azure Databricks and I'd like to create a project virtual environment, persisted on a shared compute cluster. As the cluster is shared for many projects, it is necessary to have virtual environments if I want to execute code runs from within Databricks repos. This virtual environment should be easily created via a requirements.txt or conda.yaml file, and activated via the magic command %conda activate <env_name> (or a similar fashion) at the start of a notebook. I cannot find any documentation that lists down the steps, but it seems to be doable according to @Suteja Kanuri​ answer (option2) another question thread here.

Thank you.

7 REPLIES 7

Debayan
Databricks Employee
Databricks Employee

Hi, You can refer to https://www.databricks.com/blog/2020/06/17/simplify-python-environment-management-on-databricks-runt.... Also, please look into the limitations.

Please tag @Debayan​ with your next response which will notify me, Thank you!

jole3112
New Contributor III

Hi @Debayan Mukherjee​ ,

Thank you for your response, I have read this article and here are my understanding, please feedback if they are correct:

  1. By default, all Python codes are run on the set of dependencies pre-installed in the runtime. Of course they can be different from the versions that developers use outside of Databricks.
  2. The magic commands %conda and %pip only install specific dependencies for the notebook in which these commands are run.
  3. Based on #2 above, these commands will not make sure all the modules inside the project to run on the same set of dependencies. I cannot use the method mentioned in Figure 5, because it is only for notebooks and not for normal python scripts which also need libraries on their own.
  4. "Currently, %conda activate and %conda env create are not supported." Hence there is no solution to create and use a single virtual environment throughout the project?

Can you help with some additional questions:

  1. When will the items in #4 above be made available (this article is dated 3 years ago)? This will ensure that the runtime environment and dev environment are consistent and avoid compatibility issues + avoid manual installation of additional/different packages one by one.
  2. On this statement, "If you need some libraries that are always available on the cluster, you can install them in an init script or using a docker container.", these solutions will install libraries into the cluster with no ability to separate into different environments for different projects using the same cluster? If this is the case, then everyone using the same cluster will have to use the last set of libraries that got installed?
  3. Are their any tools on Azure DB for me to create and manage virtual environments, not installing packages to the system environment?

Thank you for your kind assistance.

Debayan
Databricks Employee
Databricks Employee

Hi,

  1. The article is dated 3 years ago. I will check internally and confirm.
  2. "these solutions will install libraries into the cluster with no ability to separate into different environments for different projects using the same cluster?" I am not sure I got your question right. But it can be a shared cluster and also can be cloned after the libraries are installed.
  3. I can check on this and revert.

Thanks.

jole3112
New Contributor III

Thank you, let me know when you have the information

Anonymous
Not applicable

Hi @Joshua L​ 

We haven't heard from you since the last response from @Debayan Mukherjee​ ​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

jole3112
New Contributor III

Hi @Vidula Khanna​ 

I'm still waiting for @Debayan Mukherjee​ responses to the other 2 questions as stated in his reply.

Thank you

Debayan
Databricks Employee
Databricks Employee

Hi @Joshua L​ , Appreciate your patience on the same.

 I have checked with the SME and got the confirmation that it’s not supported and in fact it has been deprecated. I don’t think we plan to support it anytime. 

Also, we are not aware about any Azure DB tools as such on this, but this can be followed:

https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/

https://docs.python.org/3/tutorial/venv.html

Which is also the same as in installing packages into the system.

Here, without installing the package into the system , we dont think it is possible. Can you clarify if not system installed then what is the expectation? Even if it is a tool, it will install few packages in the system. Please let me know if I have misunderstood the requirement.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group