cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

How to use spark-submit python task with the usage of --archives parameter passing a .tar.gz conda env?

ryojikn
New Contributor III

We've been trying to launch a spark-submit python task using the parameter "archives", similar to that one used in Yarn.

However, we've not been able to successfully make it work in databricks.

We know that for our OnPrem installation we can use some tutorial such as this: https://conda.github.io/conda-pack/spark.html and it does use Yarn Resource Localization for uncompressing our tar.gz inside the executors.

Which is quite interesting bcuz all we need to do is to pack our project inside a wheel file and them submit inside a spark-submit.

Another reason is that if we successfully address this problem, all of our models could be much more easily migrated to Databricks, without much effort.

Does someone needed this already and have managed to solve? How the community handle spark-submit for databricks, focusing on entire projects instead of notebook-based projects?

2 REPLIES 2

Anonymous
Not applicable

@Ryoji Kuwae Neto​ :

To use the --archives parameter with a conda environment in Databricks, you can follow these steps:

1) Create a conda environment for your project and export it as a .tar.gz file:

conda create --name myenv
conda activate myenv
conda install <your required packages>
conda pack --output myenv.tar.gz

2) Upload the myenv.tar.gz file to a Databricks workspace directory or to a cloud storage location such as AWS S3 or Azure Blob Storage.

3) In your Spark application code, specify the --archives parameter with the path to the myenv.tar.gz file:

spark-submit --archives myenv.tar.gz#myenv my_script.py

Here, my_script.py is the main Python script of your Spark application. The # separator is used to specify the name of the conda environment inside the archive. In this case, the environment name is

myenv.

4) Inside your main Python script, activate the conda environment using the spark-submit command-line option --conf spark.pyspark.python=./myenv/bin/python:

import subprocess
import os
 
env_dir = "./myenv"
os.environ['PATH'] = env_dir + os.pathsep + os.environ['PATH']
subprocess.call(['./myenv/bin/python', 'my_script.py'])

This sets the Python interpreter to use the conda environment's Python binary for running the code. The env_dir variable is set to the directory where the conda environment is unpacked by Spark. The environment's bin directory is added to the PATH environment variable so that the required packages can be loaded.

Note that if your conda environment contains non-Python dependencies, such as compiled libraries, you may need to include additional configuration parameters in your Spark application to ensure that they are properly loaded at runtime.

Also, if you want to package your entire project as a wheel file, you can include it in the archive along with the conda environment. Then, in your main Python script, you can import the required modules from the wheel file using the sys.path and importlib modules.

ryojikn
New Contributor III

In the specified case, ehat is the best place to store the Tar file? Considering a spark submit in a job cluster

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.