cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ModuleNotFoundError: No module named 'databricks.sdk' in module installed via Pip

alex_crow
New Contributor II

Hello. I'm currently having an issue that I simply cannot understand nor find an adequate work-around for. Recently, my team within our organization has undergone the effort of migrating our Python code from Databricks notebooks into regular Python modules. We've started building our various modules into wheel files, uploading them to our organization's Artifactory instance, and are installing said wheel via a pip-command which resides within a common notebook that most of our downstream data transformation notebooks call using a %run command. Most of our Python modules that are found in this wheel import the databricks-sdk SparkSession and DBUtils objects first thing, using the following import statement:

from databricks.sdk.runtime import spark, dbutils

It should be noted that some of our modules have dependencies on other modules within the same directory.

This was working yesterday during my various iterations of migrating code to Python modules, building them into the wheel, uploading to Artifactory, etc. Today, upon logging on, when attempting to run a particular cell within one of our transformation notebooks that I've been using for testing, I'm greeted with the following error:

Notebook exited: PythonException:

  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 192, in _read_with_length
    return self.loads(obj)
           ^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/serializers.py", line 572, in loads
    return cloudpickle.loads(obj, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/sc_data_estate_lib/table.py", line 1, in <module>
    from databricks.sdk.runtime import spark, dbutils
ModuleNotFoundError: No module named 'databricks.sdk'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 1964, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/worker.py", line 1851, in read_udfs
    read_single_udf(
  File "/databricks/spark/python/pyspark/worker.py", line 802, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/worker_util.py", line 70, in read_command
    command = serializer._read_with_length(file)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/serializers.py", line 196, in _read_with_length
    raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 192, in _read_with_length
    return self.loads(obj)
           ^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/serializers.py", line 572, in loads
    return cloudpickle.loads(obj, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.11/site-packages/sc_data_estate_lib/table.py", line 1, in <module>
    from databricks.sdk.runtime import spark, dbutils
ModuleNotFoundError: No module named 'databricks.sdk'

We're currently running using the Databricks Runtime version 15.4 LTS. It should be noted that the function that is being called that produces this error is in-turn calling the spark.sql() function, which executes our SCD type-2 logic.

I've tried myriad combinations of options to try and regain functionality to no avail. I'm able to import the databricks.sdk.runtime package just fine when doing so from test aforementioned testing notebook, and using pip show databricks-sdk I'm able to verify that version 0.20.0 of the package is installed. I've also tried upgrading to the latest available version (0.36.0) using pip install --upgrade databricks-sdk, again to no avail. Perhaps the most frustrating piece of all this was the fact that it worked yesterday, but no longer.

If anyone can point me in the right direction, I'd greatly appreciate it. I've been wrestling with this for several days now, and would love to get things up-and-running again. Thank you.

7 REPLIES 7

alex_crow
New Contributor II

Maybe I should also mention that when doing pip install --upgrade databricks-sdk, not only is the version increased from 0.20.0 to 0.36.0, but the location of the package changes from /databricks/python3/lib/python3.11/site-packages to /local_disk0/.ephemeral_nfs/envs/pythonEnv-<guid>/lib/python3.11/site-packages. Not sure if this is significant or not.

cas001
New Contributor II

I met similar issues, I doubt it's related to spark udf. In my case, I load a sklearn model and use spark udf to speed up model predict, it raised similar error. However if I don't use udf, it could work.

  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 1964, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/worker.py", line 1851, in read_udfs
    read_single_udf(
  File "/databricks/spark/python/pyspark/worker.py", line 802, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/worker_util.py", line 72, in read_command
    command = serializer.loads(command.value)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/serializers.py", line 572, in loads
    return cloudpickle.loads(obj, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ModuleNotFoundError: No module named 'utils'

 

Volmer
New Contributor II

Hi Alex,

 

I am now facing a similar problem. Did you ever find a solution to this? 

BR

kabir
New Contributor II

Hi Team,

I have also facing same issue 

I follow  below steps

1. load_data.py

%python
from pyspark.sql import SparkSession
# Initialize Spark Session
spark = SparkSession.builder.appName("LoadData").getOrCreate()

def load_data_to_target(df😞
"""Inserts the given DataFrame into target_test_table."""
df.write.format("delta").mode("append").saveAsTable("workspace.default.target_test_table")
print("Data successfully inserted into target_test_table")

 

2. extract_data.py 

%python
from pyspark.sql import SparkSession
from load_data import load_data_to_target

# Initialize Spark Session
spark = SparkSession.builder.appName("ExtractData").getOrCreate()

def extract_data():
"""Extracts data from test_table and returns a DataFrame."""
df = spark.sql("SELECT * FROM workspace.default.test_table")
return df

# Extract the data
df_test_table = extract_data()

# Call load_data_to_target() and pass the DataFrame
load_data_to_target(df_test_table)

print("Data successfully transferred from test_table to target_test_table")

when runing extract_data.py geting below error

ModuleNotFoundError: No module named 'load_data'
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) File <command-5830530348176606>, line 2 1 from pyspark.sql import SparkSession ----> 2 from load_data import load_data_to_target 4 # Initialize Spark Session 5 spark = SparkSession.builder.appName("ExtractData").getOrCreate() File /databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py:81, in AutoreloadDiscoverabilityHook.pre_run_cell.<locals>.patched_import(name, *args, **kwargs) 75 if not self._should_hint and ( 76 (module := sys.modules.get(absolute_name)) is not None and 77 (fname := get_allowed_file_name_or_none(module)) is not None and 78 (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get( 79 absolute_name, float("inf")) and not self._should_hint): 80 self._should_hint = True ---> 81 module = self._original_builtins_import(name, *args, **kwargs) 82 if (fname := fname or get_allowed_file_name_or_none(module)) is not None: 83 mtime = mtime or os.stat(fname).st_mtime ModuleNotFoundError: No module named 'load_data'

ferdinand
New Contributor II

Did anyone make any progress here? 
I seem to have the same issue. 

It works in an interactive shell, but doesn't work in my code. 

  File "/home/ubuntu/change-detection-inference/liveeo/flows/change_detection_inference/flow.py", line 8, in <module>
    from liveeo.flows.change_detection_inference.tasks import (
  File "/home/ubuntu/change-detection-inference/liveeo/flows/change_detection_inference/tasks.py", line 15, in <module>
    import mlflow
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/__init__.py", line 42, in <module>
    from mlflow import (
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/artifacts/__init__.py", line 12, in <module>
    from mlflow.tracking import _get_store
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/tracking/__init__.py", line 8, in <module>
    from mlflow.tracking._model_registry.utils import (
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/tracking/_model_registry/utils.py", line 4, in <module>
    from mlflow.store.db.db_types import DATABASE_ENGINES
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/store/__init__.py", line 1, in <module>
    from mlflow.store import _unity_catalog  # noqa: F401
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/store/_unity_catalog/__init__.py", line 1, in <module>
    from mlflow.store._unity_catalog import registry  # noqa: F401
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/__init__.py", line 1, in <module>
    from mlflow.store._unity_catalog.registry import rest_store, uc_oss_rest_store  # noqa: F401
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/store/_unity_catalog/registry/rest_store.py", line 70, in <module>
    from mlflow.store.artifact.databricks_sdk_models_artifact_repo import (
  File "/home/ubuntu/.cache/pypoetry/virtualenvs/change-detection-inference-V1cXAqSc-py3.10/lib/python3.10/site-packages/mlflow/store/artifact/databricks_sdk_models_artifact_repo.py", line 4, in <module>
    from databricks.sdk.errors.platform import NotFound
ModuleNotFoundError: No module named 'databricks.sdk'; 'databricks' is not a package

It seems to happen on importing mlflow. 

For reference, I have databricks-sdk version 0.48.0 installed, and mlflow 2.17.2 (also tried 2.21)

In a shell I can import mlflow and even run 

from databricks.sdk.errors.platform import NotFound

without issue

alex_crow
New Contributor II

Hello again everyone, and sorry for the late response. It took a while to understand, but the cause of my issue was the attempt to create/"promote" Spark UDFs out of functions that had a dependency (or dependencies) upon classes or objects within the databricks.sdk.runtime package. The issue has been resolved for quite a while in our solution, but if I remember correctly, we had to make such functions (the ones being "promoted" to UDFs) local to our "common" notebook, which I mentioned in the OG post, meaning that rather than depending upon the spark or dbutils objects from the runtime package, those functions now depend upon the local versions of those objects that are globally available within the session they're running in. Hopefully this provides some clarity to the others who are also experiencing this issue.

ferdinand
New Contributor II

Lol, OK so in my case it was because I had a file called databricks.py with clashed with the installed databricks. 

Renaming my file to databricks_utils.py solved it. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now