topic Re: UDF importing from other modules in Data Engineering

UDF importing from other modules

Tom_Greenwood — Thu, 01 Feb 2024 15:16:54 GMT

Hi community,

I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databricks.sdk.runtime.__init__.py file with the import: from dbruntime import UserNamespaceInitializer. Then it's getting a ModuleNotFoundError: No module named dbruntime.

This udf uses functions imported from other module in the same repo (and third party modules). I'm wondering if there are limitations on doing this?

I can get the transformation to run if I put all of the code required, including the functions that are imported, into a notebook and run it but this is undesirable as we have a lot of supporting functions and really want to go down the traditional repo route. It's worth noting that non-udf imports from the repo do work (I've added it to the sys path), and also running the transform with a small dataset does work (so I assume it's a problem with the library availability on the workers).

Things I have tried that don't work:

Importing dbruntime in the notebook.
Registering all the modules used with spark.sparkContext.addPyFile("filepath") ... although I'm not sure if these would appear in the same namespace for importing in the python file.
Using Runtime 13.3 and 14.3.
Registering the udf in the file with the udf decorator.
Importing dbruntime and databrick.sdk.runtime.* in the python files.
Packaging the module into a wheel and installing it on the cluster (with and without registering this wheel with spark.sparkContext.addPyFile(<path-to-wheel>).
Using the pyspark.pandas api with no udf registration (did this first as the tranformation function is written to be used in a pandas df.apply).

Any tips and advice would be much appreciated!

Tom

Re: UDF importing from other modules

-werners- — Tue, 06 Feb 2024 12:51:37 GMT

the notebook/code you create the udf in, does that also reside in Repos?
AFAIK it is enough to import the module/function and register it as a UDF.

Re: UDF importing from other modules

Tom_Greenwood — Tue, 06 Feb 2024 15:53:46 GMT

Thanks for your reply. The function that forms the udf is in repos and the notebook is not. For most tests I am registering the udf in the repo, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Re: UDF importing from other modules

Tom_Greenwood — Tue, 06 Feb 2024 16:52:01 GMT

from pyspark.sql.functions import udf from pyspark.sql.types import StringType @udf(returnType=StringType()) def udf_1(*cols): """This works""" def helper_function(code: str) -> str: if code == "spam": return "foo" else: return "bar" return helper_function("HNA") @udf(returnType=StringType()) def udf_2(*cols): """This causes the error""" return _helper_function("spam") def _helper_function(code: str) -> str: if code == "spam": return "foo" else: return "bar"

This is a anonymised version of a test that I have created. What is strange is that the udf that fails mimics the structure of some of our module that do work (where helper functions are used).

Re: UDF importing from other modules

Tom_Greenwood — Tue, 06 Feb 2024 17:02:21 GMT

Sorry a few mistakes were in my first answer. Here is the corrected version:

Thanks for your reply. The function that forms the udf is in repos and the udf is registered and called in a notebook which is not. For most tests I am registering the udf in the notebook, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.

Re: UDF importing from other modules

-werners- — Wed, 07 Feb 2024 12:53:51 GMT

@udf(returnType=IntegerType())
def udf_function(s):
return your_function(s)

where your_function is the imported function, so you actually create a wrapper.
Also do not forget to register the udf.

Re: UDF importing from other modules

DanC — Thu, 07 Mar 2024 18:50:08 GMT

@Tom_Greenwooddid you ever find a solution to this? It looks like I have the same use case as you and hitting the same error.

I believe earlier in the year I was able to run this same code with no errors, but now the udf can't seem to import databricks imports.

Re: UDF importing from other modules

Tom_Greenwood — Mon, 11 Mar 2024 11:09:48 GMT

No, the wrapper function I showed in the snippet was the only thing that worked but wasn't practical so I've found a work around to not use a udf at all.

Re: UDF importing from other modules

DennisB — Thu, 14 Mar 2024 11:04:03 GMT

I was getting a similar error (full traceback below), and determined that it's related to this issue. Setting the env variables DATABRICKS_HOST and DATABRICKS_TOKEN as suggested in that Github issue resolved the problem for me (albeit it's not a great solution, but workable for now).

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 48) (10.139.64.15 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module> from dbruntime import UserNamespaceInitializer ModuleNotFoundError: No module named 'dbruntime' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth self._header_factory = self._credentials_provider(self) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__ raise ValueError( ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__ self.init_auth() File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads return cloudpickle.loads(obj, encoding=encoding) File "/Workspace/Repos/[REDACTED]", line 7, in <module> from databricks.sdk.runtime import spark File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module> dbutils = RemoteDbUtils() File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__ self._config = Config() if not config else config File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__ raise ValueError(message) from e ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 1825, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/databricks/spark/python/pyspark/worker.py", line 1598, in read_udfs arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0) File "/databricks/spark/python/pyspark/worker.py", line 735, in read_single_udf f, return_type = read_command(pickleSer, infile) File "/databricks/spark/python/pyspark/worker_util.py", line 67, in read_command command = serializer._read_with_length(file) File "/databricks/spark/python/pyspark/serializers.py", line 197, in _read_with_length raise SerializationError("Caused by " + traceback.format_exc()) pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module> from dbruntime import UserNamespaceInitializer ModuleNotFoundError: No module named 'dbruntime' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth self._header_factory = self._credentials_provider(self) File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__ raise ValueError( ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__ self.init_auth() File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads return cloudpickle.loads(obj, encoding=encoding) File "/Workspace/Repos/[REDACTED]", line 7, in <module> from databricks.sdk.runtime import spark File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module> dbutils = RemoteDbUtils() File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__ self._config = Config() if not config else config File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__ raise ValueError(message) from e ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

Re: UDF importing from other modules

JosiahJohnston — Tue, 30 Apr 2024 00:14:37 GMT

I've hit the same problem and isolated it to some degree. I can reproduce it in our main repo (with the python functions & UDF wrappers installed as part of a package), but cannot reproduce it in a new minimal repo I made for testing. When I copy the package source in to a non-repo folder, everything works fine.

Same type of error messages of:

... ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method. ... ModuleNotFoundError: No module named 'dbruntime' ...

I don't understand the RCA of that github issue or why setting the environment variable may help.

I'm working with databricks support to resolve, and will try to share answers here.

Re: UDF importing from other modules

josh_redmond — Mon, 29 Jul 2024 15:10:45 GMT

Did databricks support manage to help? I'm having the same issue so would be very grateful if you could share any solutions/tips they gave you

Re: UDF importing from other modules

Abdul-Mannan — Thu, 19 Sep 2024 10:12:22 GMT

I faced this issue when i was running data ingestion on unity catalog table where the cluster access mode was shared.

i changed it to `Single user` and re-ran it again, now it is working.

Re: UDF importing from other modules

JosiahJohnston — Thu, 19 Sep 2024 16:56:35 GMT

We eventually got it fixed, but I forgot to post right away. I don't remember if databricks support helped resolved, or if we figured it out on our own.

Root cause was one random (unimported) module in our library was to use dbutils to load a secret into a global variable (credentials for external S3 bucket). Leftovers from pasting code from a notebook to python module. When we refactored to remove the offending lines in the library, all of the important modules started working again for UDFs.

Re: UDF importing from other modules

drollason — Sat, 01 Mar 2025 20:33:40 GMT

Did your ever make any additional progress on this ?

I'm hitting a similar issue attempting to reuse functions across udf's when used within a DLT. Work's fine outside of the DLT.

I can embed all the code into single function and use it as a udf, but that is limiting code reuse.

Re: UDF importing from other modules

rich_avery — Thu, 01 May 2025 15:50:21 GMT

I just ran into and solved this issue. My problem was because in the python script that I loaded in as a module I defined the function that I planned to use as a udf separately from the function that I actually called in my script. I believe that because of this, the worker that is actually applying the udf function didn't have the part where I import * from my module, which would then run the from databricks.sdk.runtime import * that databricks tells you to add to a module that you plan to import. Defining the function to be used for ApplyInPandas inside of the function where I catually call ApplyInPandas fixed it

TO illustrate

def a():

''' Function to be used as a udf

'''

def b():

''' Function that I'm actually calling

'''

df.applyInPandas(a)

return