How can I execute a Spark SQL query inside a Unity Catalog Python UDF so I can run downstream ML?

ravimaranganti — Sat, 14 Jun 2025 17:10:56 GMT

I want to build an LLM-driven chatbot using Agentic AI framework within Databricks. The idea is for the LLM to generate a SQL text string which then passed to a Unity Catalog-registered Python UDF tool. Within this tool, I need the SQL to be executed (based on the SQL text string it receives from the LLM) so I can immediately run a machine learning model on the returned data. I am avoiding passing data directly to the Python UDF tool to avoid blowing token limits. This is the main reason for me trying to just pass a SQL text string to the Python UDF and have the SQL run within that.

However, any attempt to call spark.sql() or instantiate a SparkSession in my SQL-defined Python UDF fails under the SafeSpark sandbox (there is no global spark available, and SparkContext creation is blocked).

Is there a supported way for a SQL-defined Python UDF to invoke Spark SQL directly inside Unity Catalog?
If not, what production-quality patterns let me register a “query-driven” Python function in UC—one that takes only a SQL string and under the hood fetches the DataFrame and applies ML logic?

Similar questions have been asked before but without satisfactory resolution

Any pointers or examples would be greatly appreciated!

Re: How can I execute a Spark SQL query inside a Unity Catalog Python UDF so I can run downstream ML

mark_ott — Tue, 07 Oct 2025 12:12:31 GMT

There is currently no supported method for SQL-defined Python UDFs in Unity Catalog to invoke Spark SQL or access a SparkSession directly from within the SafeSpark sandbox. This limitation is by design: the SafeSpark/Restricted Python Execution Environment enforced by Unity Catalog prohibits access to global Spark objects, SparkContext, or creation of new SparkSessions within the Python UDF scope. As a result, any pattern that expects the UDF to “consume” an arbitrary SQL query, run it, and fetch results within the UDF itself, is fundamentally unsupported in this environment .

Why This Limitation Exists

Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.
Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.
Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.

Supported Patterns & Best Practices

If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:

1. Orchestrate SQL execution outside the UDF

Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.
Get the desired DataFrame as the result.
Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).
This avoids token blow-up since only the result set—not the full data—is passed to the function.

2. Python Function Registration with Arguments

Register a function that receives the DataFrame as an argument, not the SQL string.
Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.

3. Use Databricks SQL functions or Procedures

LLM generates SQL/procedure invocations.
Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.

4. Agent Pattern Outside UDF

LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.
The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.

5. MLflow Pipelines/Mappings

For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.

Why “SQL String as UDF Input” is Not Supported

The UDF runs in a completely different (sandboxed) context without Spark connectivity.
UDFs can manipulate Row input, not metafunction the Spark engine.
No internal SparkSession is available; creating new ones is explicitly blocked.

Example: Supported Orchestration

python

# Orchestration pattern outside UDF
query = llm.generate_sql(query_context)
df = spark.sql(query)
# Apply ML logic
result = my_ml_model(df)

If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.

topic How can I execute a Spark SQL query inside a Unity Catalog Python UDF so I can run downstream ML? in Data Engineering