02-01-2024 07:16 AM
Hi community,
I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databricks.sdk.runtime.__init__.py file with the import: from dbruntime import UserNamespaceInitializer. Then it's getting a ModuleNotFoundError: No module named dbruntime.
This udf uses functions imported from other module in the same repo (and third party modules). I'm wondering if there are limitations on doing this?
I can get the transformation to run if I put all of the code required, including the functions that are imported, into a notebook and run it but this is undesirable as we have a lot of supporting functions and really want to go down the traditional repo route. It's worth noting that non-udf imports from the repo do work (I've added it to the sys path), and also running the transform with a small dataset does work (so I assume it's a problem with the library availability on the workers).
Things I have tried that don't work:
Any tips and advice would be much appreciated!
Tom
02-06-2024 04:51 AM
the notebook/code you create the udf in, does that also reside in Repos?
AFAIK it is enough to import the module/function and register it as a UDF.
02-06-2024 07:53 AM
Thanks for your reply. The function that forms the udf is in repos and the notebook is not. For most tests I am registering the udf in the repo, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.
02-06-2024 09:02 AM
Sorry a few mistakes were in my first answer. Here is the corrected version:
Thanks for your reply. The function that forms the udf is in repos and the udf is registered and called in a notebook which is not. For most tests I am registering the udf in the notebook, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.
02-06-2024 08:52 AM
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def udf_1(*cols):
"""This works"""
def helper_function(code: str) -> str:
if code == "spam":
return "foo"
else:
return "bar"
return helper_function("HNA")
@udf(returnType=StringType())
def udf_2(*cols):
"""This causes the error"""
return _helper_function("spam")
def _helper_function(code: str) -> str:
if code == "spam":
return "foo"
else:
return "bar"
This is a anonymised version of a test that I have created. What is strange is that the udf that fails mimics the structure of some of our module that do work (where helper functions are used).
02-07-2024 04:53 AM
@udf(returnType=IntegerType())
def udf_function(s):
return your_function(s)
where your_function is the imported function, so you actually create a wrapper.
Also do not forget to register the udf.
03-07-2024 10:50 AM
@Tom_Greenwooddid you ever find a solution to this? It looks like I have the same use case as you and hitting the same error.
I believe earlier in the year I was able to run this same code with no errors, but now the udf can't seem to import databricks imports.
03-11-2024 04:09 AM
No, the wrapper function I showed in the snippet was the only thing that worked but wasn't practical so I've found a work around to not use a udf at all.
03-14-2024 04:04 AM
I was getting a similar error (full traceback below), and determined that it's related to this issue. Setting the env variables DATABRICKS_HOST and DATABRICKS_TOKEN as suggested in that Github issue resolved the problem for me (albeit it's not a great solution, but workable for now).
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 48) (10.139.64.15 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
self._header_factory = self._credentials_provider(self)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
self.init_auth()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
return cloudpickle.loads(obj, encoding=encoding)
File "/Workspace/Repos/[REDACTED]", line 7, in <module>
from databricks.sdk.runtime import spark
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 1825, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 1598, in read_udfs
arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
File "/databricks/spark/python/pyspark/worker.py", line 735, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker_util.py", line 67, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 197, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
self._header_factory = self._credentials_provider(self)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
self.init_auth()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
return cloudpickle.loads(obj, encoding=encoding)
File "/Workspace/Repos/[REDACTED]", line 7, in <module>
from databricks.sdk.runtime import spark
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
04-29-2024 05:14 PM
I've hit the same problem and isolated it to some degree. I can reproduce it in our main repo (with the python functions & UDF wrappers installed as part of a package), but cannot reproduce it in a new minimal repo I made for testing. When I copy the package source in to a non-repo folder, everything works fine.
Same type of error messages of:
...
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
...
ModuleNotFoundError: No module named 'dbruntime'
...
I don't understand the RCA of that github issue or why setting the environment variable may help.
I'm working with databricks support to resolve, and will try to share answers here.
07-29-2024 08:10 AM
Did databricks support manage to help? I'm having the same issue so would be very grateful if you could share any solutions/tips they gave you
3 weeks ago
We eventually got it fixed, but I forgot to post right away. I don't remember if databricks support helped resolved, or if we figured it out on our own.
Root cause was one random (unimported) module in our library was to use dbutils to load a secret into a global variable (credentials for external S3 bucket). Leftovers from pasting code from a notebook to python module. When we refactored to remove the offending lines in the library, all of the important modules started working again for UDFs.
3 weeks ago
I faced this issue when i was running data ingestion on unity catalog table where the cluster access mode was shared.
i changed it to `Single user` and re-ran it again, now it is working.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group