02-01-2024 07:16 AM
Hi community,
I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databricks.sdk.runtime.__init__.py file with the import: from dbruntime import UserNamespaceInitializer. Then it's getting a ModuleNotFoundError: No module named dbruntime.
This udf uses functions imported from other module in the same repo (and third party modules). I'm wondering if there are limitations on doing this?
I can get the transformation to run if I put all of the code required, including the functions that are imported, into a notebook and run it but this is undesirable as we have a lot of supporting functions and really want to go down the traditional repo route. It's worth noting that non-udf imports from the repo do work (I've added it to the sys path), and also running the transform with a small dataset does work (so I assume it's a problem with the library availability on the workers).
Things I have tried that don't work:
Any tips and advice would be much appreciated!
Tom
02-06-2024 04:51 AM
the notebook/code you create the udf in, does that also reside in Repos?
AFAIK it is enough to import the module/function and register it as a UDF.
02-06-2024 07:53 AM
Thanks for your reply. The function that forms the udf is in repos and the notebook is not. For most tests I am registering the udf in the repo, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.
02-06-2024 09:02 AM
Sorry a few mistakes were in my first answer. Here is the corrected version:
Thanks for your reply. The function that forms the udf is in repos and the udf is registered and called in a notebook which is not. For most tests I am registering the udf in the notebook, after importing the function, however I have also tested registering the udf in the file that it's written (with the udf decorator), and also running the application of the udf in a file in the repo instead of the notebook and I'm still getting the same error everywhere.
02-06-2024 08:52 AM
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def udf_1(*cols):
"""This works"""
def helper_function(code: str) -> str:
if code == "spam":
return "foo"
else:
return "bar"
return helper_function("HNA")
@udf(returnType=StringType())
def udf_2(*cols):
"""This causes the error"""
return _helper_function("spam")
def _helper_function(code: str) -> str:
if code == "spam":
return "foo"
else:
return "bar"
This is a anonymised version of a test that I have created. What is strange is that the udf that fails mimics the structure of some of our module that do work (where helper functions are used).
02-07-2024 04:53 AM
@udf(returnType=IntegerType())
def udf_function(s):
return your_function(s)
where your_function is the imported function, so you actually create a wrapper.
Also do not forget to register the udf.
03-07-2024 10:50 AM
@Tom_Greenwooddid you ever find a solution to this? It looks like I have the same use case as you and hitting the same error.
I believe earlier in the year I was able to run this same code with no errors, but now the udf can't seem to import databricks imports.
03-11-2024 04:09 AM
No, the wrapper function I showed in the snippet was the only thing that worked but wasn't practical so I've found a work around to not use a udf at all.
03-14-2024 04:04 AM
I was getting a similar error (full traceback below), and determined that it's related to this issue. Setting the env variables DATABRICKS_HOST and DATABRICKS_TOKEN as suggested in that Github issue resolved the problem for me (albeit it's not a great solution, but workable for now).
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 48) (10.139.64.15 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
self._header_factory = self._credentials_provider(self)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
self.init_auth()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
return cloudpickle.loads(obj, encoding=encoding)
File "/Workspace/Repos/[REDACTED]", line 7, in <module>
from databricks.sdk.runtime import spark
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 1825, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/databricks/spark/python/pyspark/worker.py", line 1598, in read_udfs
arg_offsets, f = read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=0)
File "/databricks/spark/python/pyspark/worker.py", line 735, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker_util.py", line 67, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 197, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 442, in init_auth
self._header_factory = self._credentials_provider(self)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/credentials_provider.py", line 626, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 104, in __init__
self.init_auth()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 447, in init_auth
raise ValueError(f'{self._credentials_provider.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 193, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 571, in loads
return cloudpickle.loads(obj, encoding=encoding)
File "/Workspace/Repos/[REDACTED]", line 7, in <module>
from databricks.sdk.runtime import spark
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-892f3ee3-0955-4f40-8c06-f515eed8c2df/lib/python3.10/site-packages/databricks/sdk/config.py", line 109, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.