Databricks

DavideCagnoni · ‎09-27-2022

The help of `dbx sync` states that

```for the imports to work you need to update the Python path to include this target directory you're syncing to```

This works quite well whenever the package is containing only driver-level functions. However, I ran into an issue when my `edit-mode` package contained a scikit-learn model.

In particular, the package contained a few dataframe-processing functions and also a classification model training function.

My notebook extracted some data, processed it int a trainng dataset, trained and logged a classification model into mlflow, and then retrieved it to be applied onto a simulation dataset (as a pyfunc). However, as soon as the model needs to be applied, an error was raised:

`Python: ModuleNotFoundError: No module named 'my_package'`

I think this is due to the spark workers not having the correct `sys.path` set. Is it possible to force them to look into the wanted path?

A mock of my notebook follows:

repo_base = "/Workspace/Repos/me@my.domain/"
 
import sys, os
   
sys.path.append(repo_base)
 
import mlflow
import my_package as mp
 
train, simulation = mp.split_train_and_simulation_dataset(
    full_dataset=spark.table("mydb.mydataset")
)
 
classification_model = mp.train_classifier(
    train
)
 
with mlflow.start_run() as classifier_training_run:
    mlflow.sklearn.log_model(classification_model, "model")
 
 
logged_model_uri = f"runs:/{classifier_training_run.info.run_id}/model"
 
loaded_model = mlflow.pyfunc.spark_udf(
    spark, model_uri=logged_model_uri, result_type="string"
)
 
simulation_with_prediction = simulation.withColumn(
    "predictions", loaded_model("feature_column") == F.lit("True")
)
 
display(simulation_with_prediction)
# this last command fails

DavideCagnoni · ‎09-29-2022

One workaround i found to work is to substitute the `sys.path.append` with some magic pips:

%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/

but this has the drawback of needing a `setup.py` file to work.

View solution in original post

-werners- · ‎09-29-2022

If I read it correctly, the part of the help you mention is about syncing to dbfs whereas you use Repos.

For repos:

When executing notebooks in a repo, the root of the repo is automatically added to the Python path so that imports work relative to the repo root. This means that aside from turning on autoreload you don't need to do anything else special for the changes to be reflected in the cell's execution.

DavideCagnoni · ‎09-29-2022

You are correct about the specific part of the documentation. However, no matter if i sync to a repo or to dbfs, and if I run a notebook from a repo or from the workspace, the `sys.path.append(base_folder)` failed to work in the same way as soon as some code needed to be run on the spark workers.

-werners- · ‎09-29-2022

I think that is the thing: for repos you do not have to set the sys_path, that is how I interpret the help.

DavideCagnoni · ‎09-29-2022

well even if unnecessary, adding further paths should never be an issue...

DavideCagnoni · ‎09-29-2022

One workaround i found to work is to substitute the `sys.path.append` with some magic pips:

%pip install -e /dbfs/Workspace/Repos/me@my.domain/my_package/

but this has the drawback of needing a `setup.py` file to work.

-werners- · ‎09-29-2022

Have you checked the dbx issues and discussions?

https://github.com/databrickslabs/dbx/

There are quite some issues and questions about sync.

Kaniz · ‎09-29-2022

Hi @Davide Cagnoni , We haven't heard from you on the last response from @Werner Stinckens , and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Scott_B · ‎10-25-2022

Hi @Davide Cagnoni. Please see my answer to this post https://community.databricks.com/s/question/0D53f00001mUyh2CAC/limitations-with-udfs-wrapping-module...

I will copy it here for you:

If your notebook is in the same Repo as the module, this should work without any modifications to the sys path.

If your notebook is not in the same Repo as the module, you may need to ensure that the sys path is correct on all nodes in your cluster that need the module. For example, this code should work for you:

# Create a wrapper function around my module that updates the sys path
import sys
def my_wrapper_function(x):
    sys.path.append("/Workspace/Repos/user_name/repo_name")
    from repo_name import lib_function
    return lib_function(x)
 
# Define the UDF
my_udf = udf(lambda col_name: my_wrapper_function(col_name))
 
# This should work now
df = df.withColumn(F.col("col1"), my_udf(F.col("col1")))

Databricks

How to use python packages from `sys.path` ( in some sort of "edit-mode") which functions on workers too?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs