cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I execute a Spark SQL query inside a Unity Catalog Python UDF so I can run downstream ML?

ravimaranganti
New Contributor

I want to build an LLM-driven chatbot using Agentic AI framework within Databricks. The idea is for the LLM to generate a SQL text string which then passed to a Unity Catalog-registered Python UDF tool. Within this tool,  I need the SQL to be executed (based on the SQL text string it receives from the LLM) so I can immediately run a machine learning model on the returned data. I am avoiding passing data directly to the Python UDF tool to avoid blowing token limits. This is the main reason for me trying to just pass a SQL text string to the Python UDF and have the SQL run within that. 

However, any attempt to call spark.sql() or instantiate a SparkSession in my SQL-defined Python UDF fails under the SafeSpark sandbox (there is no global spark available, and SparkContext creation is blocked).

  • Is there a supported way for a SQL-defined Python UDF to invoke Spark SQL directly inside Unity Catalog?

  • If not, what production-quality patterns let me register a “query-driven” Python function in UC—one that takes only a SQL string and under the hood fetches the DataFrame and applies ML logic?

Similar questions have been asked before but without satisfactory resolution

Any pointers or examples would be greatly appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

There is currently no supported method for SQL-defined Python UDFs in Unity Catalog to invoke Spark SQL or access a SparkSession directly from within the SafeSpark sandbox. This limitation is by design: the SafeSpark/Restricted Python Execution Environment enforced by Unity Catalog prohibits access to global Spark objects, SparkContext, or creation of new SparkSessions within the Python UDF scope. As a result, any pattern that expects the UDF to “consume” an arbitrary SQL query, run it, and fetch results within the UDF itself, is fundamentally unsupported in this environment .

Why This Limitation Exists

  • Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.

  • Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.

  • Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.

Supported Patterns & Best Practices

If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:

1. Orchestrate SQL execution outside the UDF

  • Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.

  • Get the desired DataFrame as the result.

  • Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).

  • This avoids token blow-up since only the result set—not the full data—is passed to the function.

2. Python Function Registration with Arguments

  • Register a function that receives the DataFrame as an argument, not the SQL string.

  • Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.

3. Use Databricks SQL functions or Procedures

  • LLM generates SQL/procedure invocations.

  • Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.

4. Agent Pattern Outside UDF

  • LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.

  • The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.

5. MLflow Pipelines/Mappings

  • For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.

Why “SQL String as UDF Input” is Not Supported

  • The UDF runs in a completely different (sandboxed) context without Spark connectivity.

  • UDFs can manipulate Row input, not metafunction the Spark engine.

  • No internal SparkSession is available; creating new ones is explicitly blocked.

Example: Supported Orchestration

python
# Orchestration pattern outside UDF query = llm.generate_sql(query_context) df = spark.sql(query) # Apply ML logic result = my_ml_model(df)

If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.

Further Reading / References

  • Databricks documentation on SafeSpark sandbox and Unity Catalog Python UDF limitations.

  • Databricks forum discussions on why spark.sql() is unavailable in Python UDFs.

  • Patterns for using ML and AI with Unity Catalog, best practice guides.


In summary: It is not possible to directly invoke Spark SQL from a SQL-defined Python UDF inside Unity Catalog. All production-quality solutions require running the SQL query and fetching results outside the UDF, then feeding the result DataFrame to downstream logic (ML or otherwise).

View solution in original post

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

There is currently no supported method for SQL-defined Python UDFs in Unity Catalog to invoke Spark SQL or access a SparkSession directly from within the SafeSpark sandbox. This limitation is by design: the SafeSpark/Restricted Python Execution Environment enforced by Unity Catalog prohibits access to global Spark objects, SparkContext, or creation of new SparkSessions within the Python UDF scope. As a result, any pattern that expects the UDF to “consume” an arbitrary SQL query, run it, and fetch results within the UDF itself, is fundamentally unsupported in this environment .

Why This Limitation Exists

  • Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.

  • Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.

  • Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.

Supported Patterns & Best Practices

If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:

1. Orchestrate SQL execution outside the UDF

  • Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.

  • Get the desired DataFrame as the result.

  • Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).

  • This avoids token blow-up since only the result set—not the full data—is passed to the function.

2. Python Function Registration with Arguments

  • Register a function that receives the DataFrame as an argument, not the SQL string.

  • Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.

3. Use Databricks SQL functions or Procedures

  • LLM generates SQL/procedure invocations.

  • Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.

4. Agent Pattern Outside UDF

  • LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.

  • The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.

5. MLflow Pipelines/Mappings

  • For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.

Why “SQL String as UDF Input” is Not Supported

  • The UDF runs in a completely different (sandboxed) context without Spark connectivity.

  • UDFs can manipulate Row input, not metafunction the Spark engine.

  • No internal SparkSession is available; creating new ones is explicitly blocked.

Example: Supported Orchestration

python
# Orchestration pattern outside UDF query = llm.generate_sql(query_context) df = spark.sql(query) # Apply ML logic result = my_ml_model(df)

If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.

Further Reading / References

  • Databricks documentation on SafeSpark sandbox and Unity Catalog Python UDF limitations.

  • Databricks forum discussions on why spark.sql() is unavailable in Python UDFs.

  • Patterns for using ML and AI with Unity Catalog, best practice guides.


In summary: It is not possible to directly invoke Spark SQL from a SQL-defined Python UDF inside Unity Catalog. All production-quality solutions require running the SQL query and fetching results outside the UDF, then feeding the result DataFrame to downstream logic (ML or otherwise).

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now