There is currently no supported method for SQL-defined Python UDFs in Unity Catalog to invoke Spark SQL or access a SparkSession directly from within the SafeSpark sandbox. This limitation is by design: the SafeSpark/Restricted Python Execution Environment enforced by Unity Catalog prohibits access to global Spark objects, SparkContext, or creation of new SparkSessions within the Python UDF scope. As a result, any pattern that expects the UDF to “consume” an arbitrary SQL query, run it, and fetch results within the UDF itself, is fundamentally unsupported in this environment .
Why This Limitation Exists
-
Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.
-
Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.
-
Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.
Supported Patterns & Best Practices
If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:
1. Orchestrate SQL execution outside the UDF
-
Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.
-
Get the desired DataFrame as the result.
-
Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).
-
This avoids token blow-up since only the result set—not the full data—is passed to the function.
2. Python Function Registration with Arguments
-
Register a function that receives the DataFrame as an argument, not the SQL string.
-
Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.
3. Use Databricks SQL functions or Procedures
-
LLM generates SQL/procedure invocations.
-
Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.
4. Agent Pattern Outside UDF
-
LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.
-
The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.
5. MLflow Pipelines/Mappings
-
For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.
Why “SQL String as UDF Input” is Not Supported
-
The UDF runs in a completely different (sandboxed) context without Spark connectivity.
-
UDFs can manipulate Row input, not metafunction the Spark engine.
-
No internal SparkSession is available; creating new ones is explicitly blocked.
Example: Supported Orchestration
# Orchestration pattern outside UDF
query = llm.generate_sql(query_context)
df = spark.sql(query)
# Apply ML logic
result = my_ml_model(df)
If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.
Further Reading / References
-
Databricks documentation on SafeSpark sandbox and Unity Catalog Python UDF limitations.
-
Databricks forum discussions on why spark.sql() is unavailable in Python UDFs.
-
Patterns for using ML and AI with Unity Catalog, best practice guides.
In summary: It is not possible to directly invoke Spark SQL from a SQL-defined Python UDF inside Unity Catalog. All production-quality solutions require running the SQL query and fetching results outside the UDF, then feeding the result DataFrame to downstream logic (ML or otherwise).