โ09-27-2022 02:56 AM
The help of `dbx sync` states that
```for the imports to work you need to update the Python path to include this target directory you're syncing to```
This works quite well whenever the package is containing only driver-level functions. However, I ran into an issue when my `edit-mode` package contained a scikit-learn model.
In particular, the package contained a few dataframe-processing functions and also a classification model training function.
My notebook extracted some data, processed it int a trainng dataset, trained and logged a classification model into mlflow, and then retrieved it to be applied onto a simulation dataset (as a pyfunc). However, as soon as the model needs to be applied, an error was raised:
`Python: ModuleNotFoundError: No module named 'my_package'`
I think this is due to the spark workers not having the correct `sys.path` set. Is it possible to force them to look into the wanted path?
A mock of my notebook follows:
repo_base = "/Workspace/Repos/me@my.domain/"
import sys, os
sys.path.append(repo_base)
import mlflow
import my_package as mp
train, simulation = mp.split_train_and_simulation_dataset(
full_dataset=spark.table("mydb.mydataset")
)
classification_model = mp.train_classifier(
train
)
with mlflow.start_run() as classifier_training_run:
mlflow.sklearn.log_model(classification_model, "model")
logged_model_uri = f"runs:/{classifier_training_run.info.run_id}/model"
loaded_model = mlflow.pyfunc.spark_udf(
spark, model_uri=logged_model_uri, result_type="string"
)
simulation_with_prediction = simulation.withColumn(
"predictions", loaded_model("feature_column") == F.lit("True")
)
display(simulation_with_prediction)
# this last command fails
โ09-29-2022 01:40 AM
One workaround i found to work is to substitute the `sys.path.append` with some magic pips:
%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/
but this has the drawback of needing a `setup.py` file to work.
โ09-29-2022 01:15 AM
If I read it correctly, the part of the help you mention is about syncing to dbfs whereas you use Repos.
For repos:
When executing notebooks in a repo, the root of the repo is automatically added to the Python path so that imports work relative to the repo root. This means that aside from turning on autoreload you don't need to do anything else special for the changes to be reflected in the cell's execution.
โ09-29-2022 01:28 AM
You are correct about the specific part of the documentation. However, no matter if i sync to a repo or to dbfs, and if I run a notebook from a repo or from the workspace, the `sys.path.append(base_folder)` failed to work in the same way as soon as some code needed to be run on the spark workers.
โ09-29-2022 01:31 AM
I think that is the thing: for repos you do not have to set the sys_path, that is how I interpret the help.
โ09-29-2022 01:41 AM
well even if unnecessary, adding further paths should never be an issue...
โ09-29-2022 01:40 AM
One workaround i found to work is to substitute the `sys.path.append` with some magic pips:
%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/
but this has the drawback of needing a `setup.py` file to work.
โ09-29-2022 01:48 AM
Have you checked the dbx issues and discussions?
https://github.com/databrickslabs/dbx/
There are quite some issues and questions about sync.
โ10-25-2022 09:13 AM
Hi @Davide Cagnoniโ. Please see my answer to this post https://community.databricks.com/s/question/0D53f00001mUyh2CAC/limitations-with-udfs-wrapping-module...
I will copy it here for you:
If your notebook is in the same Repo as the module, this should work without any modifications to the sys path.
If your notebook is not in the same Repo as the module, you may need to ensure that the sys path is correct on all nodes in your cluster that need the module. For example, this code should work for you:
# Create a wrapper function around my module that updates the sys path
import sys
def my_wrapper_function(x):
sys.path.append("/Workspace/Repos/user_name/repo_name")
from repo_name import lib_function
return lib_function(x)
# Define the UDF
my_udf = udf(lambda col_name: my_wrapper_function(col_name))
# This should work now
df = df.withColumn(F.col("col1"), my_udf(F.col("col1")))
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group