cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Create a SQL (Python) UDF in a Serverless SQL Warehouse using an external library

cpelazza
New Contributor

Hi There,

Scenario at work is as described in the subject line:

Is it possible to author a SQL (Python) scalar UDF IN A SQL SERVERLESS WAREHOUSE, which involves a library NOT included in any Databricks Runtime? And how would one go about it? (Documentation seems to indicate this is not possible, since its compute is self-provisioned...though not 100% clear to me. Again, not resorting to any other cluster types, it has to be done directly in SQL Serverless, otherwise this would have been resolved already).

For context, the use case deals with xml validation against (nested, very complex and large) xsd schemas, which goes beyond what DBR currently offers "out of the box" in terms of built-in functions (need all validation errors returned, flexible schema sources, etc.). But it could be anything else!! (e.g. in Snowflake, one is able to "declare" the necessary libraries within the SQL function body, plus the full Anaconda distribution is available!)

Any help/ suggestions/ comments would be greatly appreciated!!

Thank you in advance for your time!

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @cpelazza

  • The compute in SQL Serverless is self-provisioned, meaning you don’t need to manage clusters explicitly.
  • Unfortunately, importing third-party libraries not included in Databricks Runtime is not directly supported within Python UDFs.
  • To use an external library, consider the following approaches:
    • Create an Azure Function (Python-based) that performs the XML validation using the required libraries. Then call this function from your Python UDF.
    • If possible, preprocess the data outside the UDF (e.g., in a notebook or separate script) and store the results in a table. Then use the UDF for subsequent queries.
    • Implement custom logic within the UDF using built-in Python functions to handle XML validation.

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @cpelazza

  • Starting from Databricks Runtime 13.3 LTS and above, you can use the SQL CREATE FUNCTION statement t....
  • These Python UDFs allow you to write Python code and invoke it through a SQL function in a secure and fully governed way, bringing the power of Python to Databricks SQL.
  • Only Pro and SQL Serverless warehouses support Python UDFs for the Unity Catalog.
  • Since you specifically mentioned SQL Serverless, you’re on the right track!
  • While Databricks Runtime doesn’t include every library out of the box, you can still declare and use additional libraries within your Python UDF code.
  • For your XML validation use case, you can import the necessary libraries directly in your Python UDF code.
  • Suppose you want to validate XML against complex XSD schemas.
  • You can write a Python UDF that leverages libraries like lxml or any other relevant ones.
  • While Databricks doesn’t allow direct library declarations within the function body like Snowflake does, you can still import external libraries.
  •  If you have further questions or need additional assistance, don’t hesitate to ask.

Thank you for your response, @Kaniz_Fatma!!

While the answer makes sense, I haven't been able to figure out "how" one would do that, in particular (from your answer):

  • While Databricks Runtime doesn’t include every library out of the box, you can still declare and use additional libraries within your Python UDF code.
  • For your XML validation use case, you can import the necessary libraries directly in your Python UDF code.

I mean, I can import a library within the UDF's body, BUT how do I make that specific library available to the compute component of the cluster if it is NOT part of a Runtime?? (plus, I only see defusedxml library in 13.3 LTS, with deprecated .lxml module, now erroring out, lxml in 14.3 LTS though not fit for purpose, and NO Python xml library onwards, e.g. Runtime 15 series).

Would you so kind as to provide some examples of how to bring a specific library in?

Most examples I came across just do imports of built in libs.... but if I want a library not in a Runtime, the only way I know is declaring it as part of a cluster's compute, which is not the same as using Serverless.

Please correct me if I am wrong on any of these, as I am not fully familiarized with Serverless warehouses.

An example would be much appreciated!!

Thank you in advance!! 

Hi @cpelazza

  • The compute in SQL Serverless is self-provisioned, meaning you don’t need to manage clusters explicitly.
  • Unfortunately, importing third-party libraries not included in Databricks Runtime is not directly supported within Python UDFs.
  • To use an external library, consider the following approaches:
    • Create an Azure Function (Python-based) that performs the XML validation using the required libraries. Then call this function from your Python UDF.
    • If possible, preprocess the data outside the UDF (e.g., in a notebook or separate script) and store the results in a table. Then use the UDF for subsequent queries.
    • Implement custom logic within the UDF using built-in Python functions to handle XML validation.

Thank you so much for the clarification @Kaniz_Fatma !!

I believe we are either going with that option, or possibly exploring a Python DBT model.

Best regards,

cpelazza

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!