lingareddy_Alva
Esteemed Contributor

Hi @stefan-vulpe 

Looking at your code and the behavior you're describing, I can identify the core issue and provide some insights about Batch Python UDFs in Databricks.

The Core Problem
The issue you're encountering is related to session isolation and UDF registration scope in Databricks. Here's what's happening:
1. SQL Editor vs Python Notebook Sessions: When you create a UDF using CREATE OR REPLACE FUNCTION in the SQL Editor,
it gets registered in Unity Catalog but the Python function handler is tied to that specific Spark session.
2. Session Isolation: When you run spark.sql() from a Python notebook, you're using a different Spark session than the SQL Editor,
even though both can reference the same Unity Catalog function metadata.
3. Missing Function Handler: The function metadata exists in Unity Catalog, but the actual Python function implementation isn't available
in the Python notebook's session, causing it to return NULL values.

Solutions:
Option 1: Register the UDF in the Same Python Session
Option 2: Use DataFrame API Instead of SQL

The NULL values you're seeing occur because the Python function handler isn't available in your Python notebook's Spark session, even though the function metadata exists in Unity Catalog. The most reliable approach is to register your pandas UDFs within the same session where you'll use them, either through direct registration or by importing from a shared library. This limitation is inherent to how Python UDFs work in distributed environments - the actual Python code needs to be available to all executors in the session where it's being used.

 

LR

View solution in original post