Databricks Community

ravimaranganti · ‎06-14-2025

I want to build an LLM-driven chatbot using Agentic AI framework within Databricks. The idea is for the LLM to generate a SQL text string which then passed to a Unity Catalog-registered Python UDF tool. Within this tool, I need the SQL to be executed (based on the SQL text string it receives from the LLM) so I can immediately run a machine learning model on the returned data. I am avoiding passing data directly to the Python UDF tool to avoid blowing token limits. This is the main reason for me trying to just pass a SQL text string to the Python UDF and have the SQL run within that.

However, any attempt to call spark.sql() or instantiate a SparkSession in my SQL-defined Python UDF fails under the SafeSpark sandbox (there is no global spark available, and SparkContext creation is blocked).

Is there a supported way for a SQL-defined Python UDF to invoke Spark SQL directly inside Unity Catalog?
If not, what production-quality patterns let me register a “query-driven” Python function in UC—one that takes only a SQL string and under the hood fetches the DataFrame and applies ML logic?

Similar questions have been asked before but without satisfactory resolution

Any pointers or examples would be greatly appreciated!

mark_ott · ‎10-07-2025

There is currently no supported method for SQL-defined Python UDFs in Unity Catalog to invoke Spark SQL or access a SparkSession directly from within the SafeSpark sandbox. This limitation is by design: the SafeSpark/Restricted Python Execution Environment enforced by Unity Catalog prohibits access to global Spark objects, SparkContext, or creation of new SparkSessions within the Python UDF scope. As a result, any pattern that expects the UDF to “consume” an arbitrary SQL query, run it, and fetch results within the UDF itself, is fundamentally unsupported in this environment .

Why This Limitation Exists

Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.
Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.
Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.

Supported Patterns & Best Practices

If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:

1. Orchestrate SQL execution outside the UDF

Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.
Get the desired DataFrame as the result.
Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).
This avoids token blow-up since only the result set—not the full data—is passed to the function.

2. Python Function Registration with Arguments

Register a function that receives the DataFrame as an argument, not the SQL string.
Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.

3. Use Databricks SQL functions or Procedures

LLM generates SQL/procedure invocations.
Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.

4. Agent Pattern Outside UDF

LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.
The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.

5. MLflow Pipelines/Mappings

For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.

Why “SQL String as UDF Input” is Not Supported

The UDF runs in a completely different (sandboxed) context without Spark connectivity.
UDFs can manipulate Row input, not metafunction the Spark engine.
No internal SparkSession is available; creating new ones is explicitly blocked.

Example: Supported Orchestration

python

# Orchestration pattern outside UDF
query = llm.generate_sql(query_context)
df = spark.sql(query)
# Apply ML logic
result = my_ml_model(df)

If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.

Why This Limitation Exists

Security: Unity Catalog enforces strict data boundary and execution context controls to protect data access and lineage.
Isolation: UDF execution is sandboxed to prevent privilege escalation and lateral movement.
Resource Management: Spark driver/session lifecycles are managed outside the scope of function execution, and cannot be manipulated inside the UDF.

Supported Patterns & Best Practices

If the architectural goal is to let LLM/Agentic AI output dynamically-generated SQL, there are recommended production-quality alternatives:

1. Orchestrate SQL execution outside the UDF

Run the LLM-generated SQL outside the Python UDF, within the main Spark driver/notebook/job code.
Get the desired DataFrame as the result.
Pass the DataFrame to a registered Python function (not as a UDF, but e.g., as a Pandas UDF or apply logic directly).
This avoids token blow-up since only the result set—not the full data—is passed to the function.

2. Python Function Registration with Arguments

Register a function that receives the DataFrame as an argument, not the SQL string.
Or, use a function that takes parameters for dynamic filtering/aggregation (but not an arbitrary SQL string) and run the query externally.

3. Use Databricks SQL functions or Procedures

LLM generates SQL/procedure invocations.
Register more logic as SQL Procedures or Table-valued Functions (TVFs) that can internally run arbitrary SQL but with restricted logic, not arbitrary Python ML.

4. Agent Pattern Outside UDF

LLM outputs both the SQL string and a symbolic reference to which ML logic to apply.
The Spark driver/job runs the SQL statement, stores result in a temporary or managed view, then invokes the downstream Python ML function with the evaluated DataFrame.

5. MLflow Pipelines/Mappings

For cases where the operation is always “run this model on this query,” maintain a registry (MLflow, Delta Live Tables, custom mapping) from model-id ↔ ML function, and orchestrate with job workflows.

Why “SQL String as UDF Input” is Not Supported

The UDF runs in a completely different (sandboxed) context without Spark connectivity.
UDFs can manipulate Row input, not metafunction the Spark engine.
No internal SparkSession is available; creating new ones is explicitly blocked.

Example: Supported Orchestration

python

# Orchestration pattern outside UDF
query = llm.generate_sql(query_context)
df = spark.sql(query)
# Apply ML logic
result = my_ml_model(df)

If LLM needs to select the ML logic, it can output both the query and the ML routine identifier.

Databricks Community

How can I execute a Spark SQL query inside a Unity Catalog Python UDF so I can run downstream ML?

Why This Limitation Exists

Supported Patterns & Best Practices

1. Orchestrate SQL execution outside the UDF

2. Python Function Registration with Arguments

3. Use Databricks SQL functions or Procedures

4. Agent Pattern Outside UDF

5. MLflow Pipelines/Mappings

Why “SQL String as UDF Input” is Not Supported

Example: Supported Orchestration

Further Reading / References

Why This Limitation Exists

Supported Patterns & Best Practices

1. Orchestrate SQL execution outside the UDF

2. Python Function Registration with Arguments

3. Use Databricks SQL functions or Procedures

4. Agent Pattern Outside UDF

5. MLflow Pipelines/Mappings

Why “SQL String as UDF Input” is Not Supported

Example: Supported Orchestration

Further Reading / References

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples