cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

How to use python packages from `sys.path` ( in some sort of "edit-mode") which functions on workers too?

DavideCagnoni
Contributor

The help of `dbx sync` states that

```for the imports to work you need to update the Python path to include this target directory you're syncing to```

This works quite well whenever the package is containing only driver-level functions. However, I ran into an issue when my `edit-mode` package contained a scikit-learn model.

In particular, the package contained a few dataframe-processing functions and also a classification model training function.

My notebook extracted some data, processed it int a trainng dataset, trained and logged a classification model into mlflow, and then retrieved it to be applied onto a simulation dataset (as a pyfunc). However, as soon as the model needs to be applied, an error was raised:

`Python: ModuleNotFoundError: No module named 'my_package'`

I think this is due to the spark workers not having the correct `sys.path` set. Is it possible to force them to look into the wanted path?

A mock of my notebook follows:

repo_base = "/Workspace/Repos/me@my.domain/"
 
import sys, os
   
sys.path.append(repo_base)
 
import mlflow
import my_package as mp
 
train, simulation = mp.split_train_and_simulation_dataset(
    full_dataset=spark.table("mydb.mydataset")
)
 
classification_model = mp.train_classifier(
    train
)
 
with mlflow.start_run() as classifier_training_run:
    mlflow.sklearn.log_model(classification_model, "model")
 
 
logged_model_uri = f"runs:/{classifier_training_run.info.run_id}/model"
 
loaded_model = mlflow.pyfunc.spark_udf(
    spark, model_uri=logged_model_uri, result_type="string"
)
 
simulation_with_prediction = simulation.withColumn(
    "predictions", loaded_model("feature_column") == F.lit("True")
)
 
display(simulation_with_prediction)
# this last command fails 

1 ACCEPTED SOLUTION

Accepted Solutions

DavideCagnoni
Contributor

One workaround i found to work is to substitute the `sys.path.append` with some magic pips:

%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/

but this has the drawback of needing a `setup.py` file to work.

View solution in original post

8 REPLIES 8

-werners-
Esteemed Contributor III

If I read it correctly, the part of the help you mention is about syncing to dbfs whereas you use Repos.

For repos:

When executing notebooks in a repo, the root of the repo is automatically added to the Python path so that imports work relative to the repo root. This means that aside from turning on autoreload you don't need to do anything else special for the changes to be reflected in the cell's execution.

You are correct about the specific part of the documentation. However, no matter if i sync to a repo or to dbfs, and if I run a notebook from a repo or from the workspace, the `sys.path.append(base_folder)` failed to work in the same way as soon as some code needed to be run on the spark workers.

-werners-
Esteemed Contributor III

I think that is the thing: for repos you do not have to set the sys_path, that is how I interpret the help.

well even if unnecessary, adding further paths should never be an issue...

DavideCagnoni
Contributor

One workaround i found to work is to substitute the `sys.path.append` with some magic pips:

%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/

but this has the drawback of needing a `setup.py` file to work.

-werners-
Esteemed Contributor III

Have you checked the dbx issues and discussions?

https://github.com/databrickslabs/dbx/

There are quite some issues and questions about sync.

Kaniz
Community Manager
Community Manager

Hi @Davide Cagnoni​ ​ , We haven't heard from you on the last response from @Werner Stinckens​ ​, and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Scott_B
New Contributor III

Hi @Davide Cagnoni​. Please see my answer to this post https://community.databricks.com/s/question/0D53f00001mUyh2CAC/limitations-with-udfs-wrapping-module...

I will copy it here for you:

If your notebook is in the same Repo as the module, this should work without any modifications to the sys path.

If your notebook is not in the same Repo as the module, you may need to ensure that the sys path is correct on all nodes in your cluster that need the module. For example, this code should work for you:

# Create a wrapper function around my module that updates the sys path
import sys
def my_wrapper_function(x):
    sys.path.append("/Workspace/Repos/user_name/repo_name")
    from repo_name import lib_function
    return lib_function(x)
 
# Define the UDF
my_udf = udf(lambda col_name: my_wrapper_function(col_name))
 
# This should work now
df = df.withColumn(F.col("col1"), my_udf(F.col("col1")))

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.