โ06-08-2025 08:49 PM
I've got an UDF which I call using applyInPandas
That UDF is to distribute API calls.
It uses my custom .py library files that make these calls.
Everything worked until I use `dbutils.widgets.get` and `dbutils.secrets.get` inside these libraries.
It throws huge stack trace.
So the question is: how either to configure those libraries or get dbutils working?
PythonException: Traceback (most recent call last):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 106.0 failed 4 times, most recent failure: Lost task 0.3 in stage 106.0 (TID 230) (10.139.64.4 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 473, in init_auth
self._header_factory = self._credentials_strategy(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/credentials_provider.py", line 703, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 123, in __init__
self.init_auth()
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 478, in init_auth
raise ValueError(f'{self._credentials_strategy.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 192, in _read_with_length
return self.loads(obj)
^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/serializers.py", line 617, in loads
return cloudpickle.loads(obj, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 649, in subimport
__import__(name)
File "/Workspace/Shared/sparky/lib/graphql/shopify_stock_graphql.py", line 2, in <module>
import lib.configuration as conf
File "/Workspace/Shared/sparky/lib/configuration.py", line 1, in <module>
from databricks.sdk.runtime import *
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 127, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 2212, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/worker.py", line 1893, in read_udfs
arg_offsets, f = read_single_udf(
^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/worker.py", line 909, in read_single_udf
f, return_type = read_command(pickleSer, infile)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/worker_util.py", line 71, in read_command
command = serializer._read_with_length(file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/serializers.py", line 196, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/runtime/__init__.py", line 79, in <module>
from dbruntime import UserNamespaceInitializer
ModuleNotFoundError: No module named 'dbruntime'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 473, in init_auth
self._header_factory = self._credentials_strategy(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/credentials_provider.py", line 703, in __call__
raise ValueError(
ValueError: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 123, in __init__
self.init_auth()
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 478, in init_auth
raise ValueError(f'{self._credentials_strategy.auth_type()} auth: {e}') from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 192, in _read_with_length
return self.loads(obj)
^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/serializers.py", line 617, in loads
return cloudpickle.loads(obj, encoding=encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 649, in subimport
__import__(name)
File "/Workspace/Shared/sparky/lib/graphql/shopify_stock_graphql.py", line 2, in <module>
import lib.configuration as conf
File "/Workspace/Shared/sparky/lib/configuration.py", line 1, in <module>
from databricks.sdk.runtime import *
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/runtime/__init__.py", line 172, in <module>
dbutils = RemoteDbUtils()
^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/dbutils.py", line 194, in __init__
self._config = Config() if not config else config
^^^^^^^^
File "/databricks/python/lib/python3.12/site-packages/databricks/sdk/config.py", line 127, in __init__
raise ValueError(message) from e
ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.
..........
3 weeks ago
Currently, dbutils cannot be used inside of UDFs. For secrets, instead of getting the secret inside of the UDF, you can define it as a free variable outside of the UDF and it will be passed in properly, like the below
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import *
import pandas as pd
secret = dbutils.secrets.get("scope", "secret")
@pandas_udf(LongType())
def example_udf(value: pd.Series) -> pd.Series:
print(secret)
return value
spark.range(1).select(example_udf(col("id"))).display()
3 weeks ago
Currently, dbutils cannot be used inside of UDFs. For secrets, instead of getting the secret inside of the UDF, you can define it as a free variable outside of the UDF and it will be passed in properly, like the below
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import *
import pandas as pd
secret = dbutils.secrets.get("scope", "secret")
@pandas_udf(LongType())
def example_udf(value: pd.Series) -> pd.Series:
print(secret)
return value
spark.range(1).select(example_udf(col("id"))).display()
3 weeks ago - last edited 3 weeks ago
What about creating a function like this?
CREATE OR REPLACE FUNCTION geocode_address(address STRING)
RETURNS STRUCT<latitude: DOUBLE, longitude: DOUBLE>
LANGUAGE PYTHON
ENVIRONMENT (
dependencies = '["requests"]',
environment_version = "None"
)
AS $$
import requests
api_key = dbutils.secrets.get("my-secret-scope", "google-maps-geocoding-api-key")
url = f"https://maps.googleapis.com/maps/api/geocode/json?address={address}&key={api_key}"
response = requests.get(url)
if response.status_code != 200:
return None
try:
data = response.json()
if data['status'] == 'OK':
location = data['results'][0]['geometry']['location']
return (location['lat'], location['lng'])
else:
return None
except (KeyError, ValueError):
return None
$$
...and then testing it like this:
SELECT geocode_address('1600 Amphitheatre Parkway, Mountain View, CA');
Currently it results in the following error:
NameError: name 'dbutils' is not defined
What's the recommended way of retrieving the secret in this case?
3 weeks ago
Answering my own question. Similar to the original response, the answer was to pass in the secret as a function argument:
CREATE OR REPLACE FUNCTION geocode_address(address STRING, api_key STRING)
RETURNS STRUCT<latitude: DOUBLE, longitude: DOUBLE>
LANGUAGE PYTHON
AS $$
import requests
url = f"https://maps.googleapis.com/maps/api/geocode/json?address={address}&key={api_key}"
response = requests.get(url)
if response.status_code != 200:
return None
try:
data = response.json()
if data['status'] == 'OK':
location = data['results'][0]['geometry']['location']
return (location['lat'], location['lng'])
else:
return None
except (KeyError, ValueError):
return None
$$
And then here is how to call it:
SELECT geocode_address('1600 Amphitheatre Parkway, Mountain View, CA', secret("my-secret-scope", "google-maps-geocoding-api-key"));
Note: this won't work on a Serverless Warehouse (or Serverless compute) as by default they restrict outbound traffic.
3 weeks ago
I ran outbound graphql on serverless, but on Azure version of the Databricks. Azure VMs don't restrict this.
My problem with serverless is How to "Python versions in the Spark Connect clien... - Databricks Community - 121213, so serverless is still unusable for UDFs.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now