How to use spark-submit python task with the usage of --archives parameter passing a .tar.gz conda env?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-30-2023 08:52 AM
We've been trying to launch a spark-submit python task using the parameter "archives", similar to that one used in Yarn.
However, we've not been able to successfully make it work in databricks.
We know that for our OnPrem installation we can use some tutorial such as this: https://conda.github.io/conda-pack/spark.html and it does use Yarn Resource Localization for uncompressing our tar.gz inside the executors.
Which is quite interesting bcuz all we need to do is to pack our project inside a wheel file and them submit inside a spark-submit.
Another reason is that if we successfully address this problem, all of our models could be much more easily migrated to Databricks, without much effort.
Does someone needed this already and have managed to solve? How the community handle spark-submit for databricks, focusing on entire projects instead of notebook-based projects?
- Labels:
-
Conda
-
Python Task
-
Spark
-
Spark-submit
data:image/s3,"s3://crabby-images/42b93/42b9345c42b8f7964a094a99d153a8dc1c5eb2fb" alt=""
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-10-2023 07:04 AM
@Ryoji Kuwae Neto :
To use the --archives parameter with a conda environment in Databricks, you can follow these steps:
1) Create a conda environment for your project and export it as a .tar.gz file:
conda create --name myenv
conda activate myenv
conda install <your required packages>
conda pack --output myenv.tar.gz
2) Upload the myenv.tar.gz file to a Databricks workspace directory or to a cloud storage location such as AWS S3 or Azure Blob Storage.
3) In your Spark application code, specify the --archives parameter with the path to the myenv.tar.gz file:
spark-submit --archives myenv.tar.gz#myenv my_script.py
Here, my_script.py is the main Python script of your Spark application. The # separator is used to specify the name of the conda environment inside the archive. In this case, the environment name is
myenv.
4) Inside your main Python script, activate the conda environment using the spark-submit command-line option --conf spark.pyspark.python=./myenv/bin/python:
import subprocess
import os
env_dir = "./myenv"
os.environ['PATH'] = env_dir + os.pathsep + os.environ['PATH']
subprocess.call(['./myenv/bin/python', 'my_script.py'])
This sets the Python interpreter to use the conda environment's Python binary for running the code. The env_dir variable is set to the directory where the conda environment is unpacked by Spark. The environment's bin directory is added to the PATH environment variable so that the required packages can be loaded.
Note that if your conda environment contains non-Python dependencies, such as compiled libraries, you may need to include additional configuration parameters in your Spark application to ensure that they are properly loaded at runtime.
Also, if you want to package your entire project as a wheel file, you can include it in the archive along with the conda environment. Then, in your main Python script, you can import the required modules from the wheel file using the sys.path and importlib modules.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-10-2023 04:58 PM
In the specified case, ehat is the best place to store the Tar file? Considering a spark submit in a job cluster
data:image/s3,"s3://crabby-images/2345c/2345ca6ff2e34b0d370ce03453929e5fd0c4a88d" alt=""
data:image/s3,"s3://crabby-images/2345c/2345ca6ff2e34b0d370ce03453929e5fd0c4a88d" alt=""