Databricks Community

Tom_Greenwood · ‎02-01-2024

Hi community,

I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databricks.sdk.runtime.__init__.py file with the import: from dbruntime import UserNamespaceInitializer. Then it's getting a ModuleNotFoundError: No module named dbruntime.

This udf uses functions imported from other module in the same repo (and third party modules). I'm wondering if there are limitations on doing this?

I can get the transformation to run if I put all of the code required, including the functions that are imported, into a notebook and run it but this is undesirable as we have a lot of supporting functions and really want to go down the traditional repo route. It's worth noting that non-udf imports from the repo do work (I've added it to the sys path), and also running the transform with a small dataset does work (so I assume it's a problem with the library availability on the workers).

Things I have tried that don't work:

Importing dbruntime in the notebook.
Registering all the modules used with spark.sparkContext.addPyFile("filepath") ... although I'm not sure if these would appear in the same namespace for importing in the python file.
Using Runtime 13.3 and 14.3.
Registering the udf in the file with the udf decorator.
Importing dbruntime and databrick.sdk.runtime.* in the python files.
Packaging the module into a wheel and installing it on the cluster (with and without registering this wheel with spark.sparkContext.addPyFile(<path-to-wheel>).
Using the pyspark.pandas api with no udf registration (did this first as the tranformation function is written to be used in a pandas df.apply).

Any tips and advice would be much appreciated!

Tom

-werners- · ‎02-06-2024

the notebook/code you create the udf in, does that also reside in Repos?
AFAIK it is enough to import the module/function and register it as a UDF.

Tom_Greenwood · ‎02-06-2024

Thanks for your reply. The function that forms the udf is in repos and the notebook is not. For most tests I am registering the udf in the repo, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Tom_Greenwood · ‎02-06-2024

Sorry a few mistakes were in my first answer. Here is the corrected version:

Thanks for your reply. The function that forms the udf is in repos and the udf is registered and called in a notebook which is not. For most tests I am registering the udf in the notebook, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Tom_Greenwood · ‎02-06-2024

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def udf_1(*cols):
    """This works"""
    def helper_function(code: str) -> str:
        if code == "spam":
            return "foo"
        else:
            return "bar"

    return helper_function("HNA")

@udf(returnType=StringType())
def udf_2(*cols):
    """This causes the error"""
    return _helper_function("spam")

def _helper_function(code: str) -> str:
    if code == "spam":
        return "foo"
    else:
        return "bar"

This is a anonymised version of a test that I have created. What is strange is that the udf that fails mimics the structure of some of our module that do work (where helper functions are used).

drollason · ‎03-01-2025

Did your ever make any additional progress on this ?

I'm hitting a similar issue attempting to reuse functions across udf's when used within a DLT. Work's fine outside of the DLT.

I can embed all the code into single function and use it as a udf, but that is limiting code reuse.

-werners- · ‎02-07-2024

@udf(returnType=IntegerType())
def udf_function(s):
return your_function(s)

where your_function is the imported function, so you actually create a wrapper.
Also do not forget to register the udf.

DanC · ‎03-07-2024

@Tom_Greenwooddid you ever find a solution to this? It looks like I have the same use case as you and hitting the same error.

I believe earlier in the year I was able to run this same code with no errors, but now the udf can't seem to import databricks imports.

Tom_Greenwood · ‎03-11-2024

No, the wrapper function I showed in the snippet was the only thing that worked but wasn't practical so I've found a work around to not use a udf at all.

DennisB · ‎03-14-2024

I was getting a similar error (full traceback below), and determined that it's related to this issue. Setting the env variables DATABRICKS_HOST and DATABRICKS_TOKEN as suggested in that Github issue resolved the problem for me (albeit it's not a great solution, but workable for now).

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 48) (10.139.64.15 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
    from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
    self._header_factory = self._credentials_provider(self)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
    raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
    self.init_auth()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
    raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
    return cloudpickle.loads(obj, encoding=encoding)
  File "/Workspace/Repos/[REDACTED]", line 7, in <module>
    from databricks.sdk.runtime import spark
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
    dbutils = RemoteDbUtils()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
    self._config = Config() if not config else config
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
    raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 1825, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/databricks/spark/python/pyspark/worker.py", line 1598, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/databricks/spark/python/pyspark/worker.py", line 735, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/databricks/spark/python/pyspark/worker_util.py", line 67, in read_command
    command = serializer._read_with_length(file)
  File "/databricks/spark/python/pyspark/serializers.py", line 197, in _read_with_length
    raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
    from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
    self._header_factory = self._credentials_provider(self)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
    raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
    self.init_auth()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
    raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
    return cloudpickle.loads(obj, encoding=encoding)
  File "/Workspace/Repos/[REDACTED]", line 7, in <module>
    from databricks.sdk.runtime import spark
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
    dbutils = RemoteDbUtils()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
    self._config = Config() if not config else config
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
    raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

JosiahJohnston · ‎04-29-2024

I've hit the same problem and isolated it to some degree. I can reproduce it in our main repo (with the python functions & UDF wrappers installed as part of a package), but cannot reproduce it in a new minimal repo I made for testing. When I copy the package source in to a non-repo folder, everything works fine.

Same type of error messages of:

...
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
...
ModuleNotFoundError: No module named 'dbruntime'
...

I don't understand the RCA of that github issue or why setting the environment variable may help.

I'm working with databricks support to resolve, and will try to share answers here.

josh_redmond · ‎07-29-2024

Did databricks support manage to help? I'm having the same issue so would be very grateful if you could share any solutions/tips they gave you

JosiahJohnston · ‎09-19-2024

We eventually got it fixed, but I forgot to post right away. I don't remember if databricks support helped resolved, or if we figured it out on our own.

Root cause was one random (unimported) module in our library was to use dbutils to load a secret into a global variable (credentials for external S3 bucket). Leftovers from pasting code from a notebook to python module. When we refactored to remove the offending lines in the library, all of the important modules started working again for UDFs.

Abdul-Mannan · ‎09-19-2024

I faced this issue when i was running data ingestion on unity catalog table where the cluster access mode was shared.

i changed it to `Single user` and re-ran it again, now it is working.

rich_avery · ‎05-01-2025

I just ran into and solved this issue. My problem was because in the python script that I loaded in as a module I defined the function that I planned to use as a udf separately from the function that I actually called in my script. I believe that because of this, the worker that is actually applying the udf function didn't have the part where I import * from my module, which would then run the from databricks.sdk.runtime import * that databricks tells you to add to a module that you plan to import. Defining the function to be used for ApplyInPandas inside of the function where I catually call ApplyInPandas fixed it

TO illustrate

def a():

''' Function to be used as a udf

'''

def b():

''' Function that I'm actually calling

'''

df.applyInPandas(a)

return