- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2022 02:56 AM
The help of `dbx sync` states that
```for the imports to work you need to update the Python path to include this target directory you're syncing to```
This works quite well whenever the package is containing only driver-level functions. However, I ran into an issue when my `edit-mode` package contained a scikit-learn model.
In particular, the package contained a few dataframe-processing functions and also a classification model training function.
My notebook extracted some data, processed it int a trainng dataset, trained and logged a classification model into mlflow, and then retrieved it to be applied onto a simulation dataset (as a pyfunc). However, as soon as the model needs to be applied, an error was raised:
`Python: ModuleNotFoundError: No module named 'my_package'`
I think this is due to the spark workers not having the correct `sys.path` set. Is it possible to force them to look into the wanted path?
A mock of my notebook follows:
repo_base = "/Workspace/Repos/me@my.domain/"
import sys, os
sys.path.append(repo_base)
import mlflow
import my_package as mp
train, simulation = mp.split_train_and_simulation_dataset(
full_dataset=spark.table("mydb.mydataset")
)
classification_model = mp.train_classifier(
train
)
with mlflow.start_run() as classifier_training_run:
mlflow.sklearn.log_model(classification_model, "model")
logged_model_uri = f"runs:/{classifier_training_run.info.run_id}/model"
loaded_model = mlflow.pyfunc.spark_udf(
spark, model_uri=logged_model_uri, result_type="string"
)
simulation_with_prediction = simulation.withColumn(
"predictions", loaded_model("feature_column") == F.lit("True")
)
display(simulation_with_prediction)
# this last command fails