How to use python packages from `sys.path` ( in so...

DavideCagnoni · ‎09-27-2022

The help of `dbx sync` states that

```for the imports to work you need to update the Python path to include this target directory you're syncing to```

This works quite well whenever the package is containing only driver-level functions. However, I ran into an issue when my `edit-mode` package contained a scikit-learn model.

In particular, the package contained a few dataframe-processing functions and also a classification model training function.

My notebook extracted some data, processed it int a trainng dataset, trained and logged a classification model into mlflow, and then retrieved it to be applied onto a simulation dataset (as a pyfunc). However, as soon as the model needs to be applied, an error was raised:

`Python: ModuleNotFoundError: No module named 'my_package'`

I think this is due to the spark workers not having the correct `sys.path` set. Is it possible to force them to look into the wanted path?

A mock of my notebook follows:

repo_base = "/Workspace/Repos/me@my.domain/"
 
import sys, os
   
sys.path.append(repo_base)
 
import mlflow
import my_package as mp
 
train, simulation = mp.split_train_and_simulation_dataset(
    full_dataset=spark.table("mydb.mydataset")
)
 
classification_model = mp.train_classifier(
    train
)
 
with mlflow.start_run() as classifier_training_run:
    mlflow.sklearn.log_model(classification_model, "model")
 
 
logged_model_uri = f"runs:/{classifier_training_run.info.run_id}/model"
 
loaded_model = mlflow.pyfunc.spark_udf(
    spark, model_uri=logged_model_uri, result_type="string"
)
 
simulation_with_prediction = simulation.withColumn(
    "predictions", loaded_model("feature_column") == F.lit("True")
)
 
display(simulation_with_prediction)
# this last command fails

How to use python packages from `sys.path` ( in some sort of "edit-mode") which functions on workers too?