โ12-13-2022 03:11 AM
I'm trying to set PYTHONPATH env variable in the cluster configuration: `PYTHONPATH=/dbfs/user/blah`. But in the driver and executor envs it is probably getting overridden and i don't see it.
`%sh echo $PYTHONPATH` outputs:
`PYTHONPATH=/databricks/spark/python:/databricks/spark/python/lib/py4j-0.10.9.5-src.zip:/databricks/jars/spark--driver--driver-spark_3.3_2.12_deploy.jar:/WSFS_NOTEBOOK_DIR:/databricks/spark/python:/databricks/python_shell`
and `import sys; print(sys.path)`:
```
'/databricks/python_shell/scripts', '/local_disk0/spark-c87ff3f0-1b67-4ec4-9054-079bba1860a1/userFiles-ea2f1344-51c6-4363-9112-a0dcdff663d0', '/databricks/spark/python', '/databricks/spark/python/lib/py4j-0.10.9.5-src.zip', '/databricks/jars/spark--driver--driver-spark_3.3_2.12_deploy.jar', '/databricks/python_shell', '/usr/lib/python39.zip', '/usr/lib/python3.9', '/usr/lib/python3.9/lib-dynload', '', '/local_disk0/.ephemeral_nfs/envs/pythonEnv-267a0576-e6bd-4505-b257-37a4560e4756/lib/python3.9/site-packages', '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages', '/databricks/python/lib/python3.9/site-packages', '/usr/local/lib/python3.9/dist-packages', '/usr/lib/python3/dist-packages', '/databricks/python/lib/python3.9/site-packages/IPython/extensions', '/root/.ipython'
```
if i work from Repos it does add the repo to everywhere `/Workspace/Repos/user@domain.com/my_repo`, but then i need all my modules to be straight there and it is not convenient.
please let me know if there's a work-around to set a `/dbfs/` path in all nodes without ugly trick of ***** UDF, but straight from the cluster init script or the best would be dynamic `spark.conf` property.
โ12-13-2022 03:17 AM
Hi @Ohad Ravivโ can you try init-scripts, it might help you. https://docs.databricks.com/clusters/init-scripts.html
โ12-26-2022 06:06 AM
init script won't work if you meant export PYTHONPATH env setting. Databricks shell overwrites it when it starts the python interpreter. One approach we make it work is if the code is under /dbfs, we do editable install at init script, e.g.
pip install -e /dbfs/some_repos_code
this creates a easy-install.pth under /databricks/python3 site-packages at cluster initialization, which will append to sys.path to driver and worker.
This approach avoids appending sys.path everywhere in the code, which breaks the code integrity; easier to enforce at cluster level.
We also tried to do the same editable install for Repos under /Workspace but failed. Apparently /Workspace partition is not mounted during cluster initialization. We are going to request databricks to look into this.
โ12-13-2022 08:38 AM
do you have any suggestions as to what should I run in the init-script?
setting an env variable there has no effect as it cannot change the main process env.
how would I add a library to the python path?
and even if I could, it would be hard-coded library and I would then need a dedicated cluster configuration for every developer/library.
โ12-13-2022 11:50 PM
Update:
At last found a (hacky) solution!
in the driver I can dynamically set the sys.path in the workers with:
`spark._sc._python_includes.append("/dbfs/user/blah/")`
combine that with, in the driver:
```
%load_ext autoreload
%autoreload 2
```
and setting: `spark.conf("spark.python.worker.reuse", "false")`
and we have a fully interactive Spark session with the ability to change python module code without the need to restart the Spark Session/Cluster.
โ12-14-2022 04:19 AM
Thats great, Thanks for sharing the solution.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group