cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Python UDF in Unity Catalog - spark.sql error

MartinIsti
New Contributor III

I'm trying to utilise the option to create UDFs in Unity Catalog. That would be a great way to have functions available in a fairly straightforward manner without e.g. putting the function definitions in an extra notebook that I %run to make them available.

So when I try to follow https://learn.microsoft.com/en-us/azure/databricks/udf/unity-catalog I create the following function:

 

CREATE OR REPLACE FUNCTION catalog.schema.WatermarkRead_UC(ADLSLocation STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$

    WatermarkValue = spark.sql(f"SELECT WatermarkValue FROM PARQUET.`{ADLSLocation}/_watermark_log`").collect()[0][0]

    return WatermarkValue

$$

 

And then call it:

 

SELECT catalog.schema.WatermarkRead_UC('abfss://container@storage.dfs.core.windows.net/path')

 

It returns the following error message:

 

NameError: name 'spark' is not defined

 

I tried all sorts of things but I couldn't make it work. Wouldn't spark be supported out of the box? The same function works as expected when I simply define it in a separate notebook, then %run that notebook and I can easily use the function and it runs and returns a value.

I wonder if it is a current limitation or a bug or an error in my code / design? Any help would be appreciated. Thanks

P.s.: I know I can register a UDF outside Unity Catalog and that I can create a Python wheel to import from in the notebooks but I'm after a UC-based solution if that is possible. Thanks

1 REPLY 1

MartinIsti
New Contributor III

I can see someone has asked a very similar question with the same error message:

https://community.databricks.com/t5/data-engineering/unable-to-use-sql-udf/td-p/61957

The OP hasn't yet provided sufficient details about his/her function so no proper response has appeared so far. I have gone through the 4 listed points to make sure I have narrowed down the root cause of the error. And I have.

See below an even more simplified function definition (to rule out the possibility if the cluster has access to the storage location) that fails with the same NameError: name 'spark' is not defined error

CREATE OR REPLACE FUNCTION dev_fusion.log.WatermarkRead_UC(ADLSLocation STRING, WatermarkAttribute STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$

    WatermarkValue = 'spark.sql(f"SELECT 'value'").collect()[0][0]'

    return WatermarkValue

$$

And one that works:

CREATE OR REPLACE FUNCTION dev_fusion.log.WatermarkRead_UC(ADLSLocation STRING, WatermarkAttribute STRING)
RETURNS STRING
LANGUAGE PYTHON
AS $$

    WatermarkValue = 'Value'

    return WatermarkValue

$$

The main difference being the spark.sql part.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.