cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

UDF importing from other modules

Tom_Greenwood
New Contributor III

Hi community,

I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databricks.sdk.runtime.__init__.py file with the import: from dbruntime import UserNamespaceInitializer. Then it's getting a ModuleNotFoundError: No module named dbruntime.

Tom_Greenwood_0-1706798998837.png

This udf uses functions imported from other module in the same repo (and third party modules). I'm wondering if there are limitations on doing this?

I can get the transformation to run if I put all of the code required, including the functions that are imported, into a notebook and run it but this is undesirable as we have a lot of supporting functions and really want to go down the traditional repo route. It's worth noting that non-udf imports from the repo do work (I've added it to the sys path), and also running the transform with a small dataset does work (so I assume it's a problem with the library availability on the workers).

Things I have tried that don't work:

  • Importing dbruntime in the notebook.
  • Registering all the modules used with spark.sparkContext.addPyFile("filepath") ... although I'm not sure if these would appear in the same namespace for importing in the python file.
  • Using Runtime 13.3 and 14.3.
  • Registering the udf in the file with the udf decorator.
  • Importing dbruntime and databrick.sdk.runtime.* in the python files.
  • Packaging the module into a wheel and installing it on the cluster (with and without registering this wheel with spark.sparkContext.addPyFile(<path-to-wheel>).
  • Using the pyspark.pandas api with no udf registration (did this first as the tranformation function is written to be used in a pandas df.apply).

Any tips and advice would be much appreciated!

Tom

12 REPLIES 12

-werners-
Esteemed Contributor III

the notebook/code you create the udf in, does that also reside in Repos?
AFAIK it is enough to import the module/function and register it as a UDF.

Thanks for your reply. The function that forms the udf is in repos and the notebook is not. For most tests I am registering the udf in the repo, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Sorry a few mistakes were in my first answer. Here is the corrected version:

Thanks for your reply. The function that forms the udf is in repos and the udf is registered and called in a notebook which is not. For most tests I am registering the udf in the notebook, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Tom_Greenwood
New Contributor III
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def udf_1(*cols):
    """This works"""
    def helper_function(code: str) -> str:
        if code == "spam":
            return "foo"
        else:
            return "bar"

    return helper_function("HNA")

@udf(returnType=StringType())
def udf_2(*cols):
    """This causes the error"""
    return _helper_function("spam")

def _helper_function(code: str) -> str:
    if code == "spam":
        return "foo"
    else:
        return "bar"

This is a anonymised version of a test that I have created. What is strange is that the udf that fails mimics the structure of some of our module that do work (where helper functions are used).

-werners-
Esteemed Contributor III

@udf(returnType=IntegerType())
def udf_function(s):
return your_function(s)

where your_function is the imported function, so you actually create a wrapper.
Also do not forget to register the udf.

DanC
New Contributor II

@Tom_Greenwooddid you ever find a solution to this? It looks like I have the same use case as you and hitting the same error.

I believe earlier in the year I was able to run this same code with no errors, but now the udf can't seem to import databricks imports.

Tom_Greenwood
New Contributor III

No, the wrapper function I showed in the snippet was the only thing that worked but wasn't practical so I've found a work around to not use a udf at all.

DennisB
New Contributor III

I was getting a similar error (full traceback below), and determined that it's related to this issue. Setting the env variables DATABRICKS_HOST and DATABRICKS_TOKEN as suggested in that Github issue resolved the problem for me (albeit it's not a great solution, but workable for now).

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 48) (10.139.64.15 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
    from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
    self._header_factory = self._credentials_provider(self)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
    raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
    self.init_auth()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
    raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
    return cloudpickle.loads(obj, encoding=encoding)
  File "/Workspace/Repos/[REDACTED]", line 7, in <module>
    from databricks.sdk.runtime import spark
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
    dbutils = RemoteDbUtils()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
    self._config = Config() if not config else config
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
    raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 1825, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/databricks/spark/python/pyspark/worker.py", line 1598, in read_udfs
    arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
  File "/databricks/spark/python/pyspark/worker.py", line 735, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/databricks/spark/python/pyspark/worker_util.py", line 67, in read_command
    command = serializer._read_with_length(file)
  File "/databricks/spark/python/pyspark/serializers.py", line 197, in _read_with_length
    raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
    from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
    self._header_factory = self._credentials_provider(self)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
    raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
    self.init_auth()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
    raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
    return self.loads(obj)
  File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
    return cloudpickle.loads(obj, encoding=encoding)
  File "/Workspace/Repos/[REDACTED]", line 7, in <module>
    from databricks.sdk.runtime import spark
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
    dbutils = RemoteDbUtils()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
    self._config = Config() if not config else config
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
    raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

 

I've hit the same problem and isolated it to some degree. I can reproduce it in our main repo (with the python functions & UDF wrappers installed as part of a package), but cannot reproduce it in a new minimal repo I made for testing. When I copy the package source in to a non-repo folder, everything works fine. 

Same type of error messages of:

 

...
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
...
ModuleNotFoundError: No module named 'dbruntime'
...

 

 I don't understand the RCA of that github issue or why setting the environment variable may help.

I'm working with databricks support to resolve, and will try to share answers here.

Did databricks support manage to help? I'm having the same issue so would be very grateful if you could share any solutions/tips they gave you

We eventually got it fixed, but I forgot to post right away. I don't remember if databricks support helped resolved, or if we figured it out on our own. 

Root cause was one random (unimported) module in our library was to use dbutils to load a secret into a global variable (credentials for external S3 bucket). Leftovers from pasting code from a notebook to python module. When we refactored to remove the offending lines in the library, all of the important modules started working again for UDFs.

Abdul-Mannan
New Contributor III

I faced this issue when i was running data ingestion on unity catalog table where the cluster access mode was shared.

i changed it to `Single user` and re-ran it again, now it is working.

AbdulMannan_0-1726740708842.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group