cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Register permanent UDF from Python file

IvanK
New Contributor III

Hello,

I am trying to create a permanent UDF from a Python file with dependencies that are not part of the standard Python library.

How do I make use of CREATE FUNCTION (External) [1] to create a permanent function in Databricks, using a Python file that contains my function?

NOTE: If I understand it correctly, CREATE FUNCTION (SQL and Python) [2], will not work in our case because dependencies are limited to the libraries defined in [2]. We also want to automate this, meaning that I will be writing/updating our functions in a gitrepo and use CI/CD to create/update the permanent functions in Databricks.

 

References

[1] https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-function.html

[2] https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html#suppo...

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @IvanK , The pyspark.sql.functions.udf is a method in PySpark which allows you to create User Defined Functions (UDFs).

These UDFs can be used to perform operations that are not defined in Spark.

Here is a general way to create a UDF in Databricks:

1. Define your Python function in a Python file.
2. Read this Python file in Databricks.
3. Use pyspark.sql.functions.udf to convert your Python function into a UDF.

Here is an example of how you can create a UDF:

python
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

# Define your Python function
def my_function(s):
    return len(s)

# Convert your Python function into a UDF
my_udf = udf(my_function, IntegerType())

You can then use my_udf in your Spark dataframes.

As for automating this process with CI/CD, you can consider the following steps:

1. Store your Python files containing the UDFs in a Git repository.
2. Set up a CI/CD pipeline that triggers whenever there are changes in this Git repository.
3. In the pipeline, write a script that reads the Python files, creates the UDFs, and updates them in Databricks.

Please note that the exact way to set up this pipeline depends on the CI/CD tool you're using and how you're using Databricks.

IvanK
New Contributor III

Hello @Kaniz_Fatma ,
Thank you for the reply but I do not think this will solve our problem, let me give you some more background info:

We are going to use the orchestration tool Prefect to start clusters in Databricks and then run dbt on the started clusters to perform e.g. transformation of data.
The transformations we will use are written in Python scripts (it is way too complex to write these transformations in SQL). The Python scripts will make use of other libraries, and we write/update our Python scripts in our Git repo (outside of Databricks).

As I understand, Databricks clusters that we start must have an active SparkSession in order to be able to register UDFs. An active SparkSession is only created when connecting a Notebook to the cluster (correct me if I am wrong). We do not want to use Notebooks, as it adds more complexity.


Also, UDF is only created in a single SparkSession. This means that you have to register the UDF every time you start a cluster or create a new cluster (which is not convenient). (NOTE: Unity Catalog defined Python UDFs are limited to libraries in the Databricks Runtime, and as we have other libraries we need to use, defining Python UDFs in Unity Catalog is not an option for us)

So to get around this problem, we would like to create a permanent function, that can be used by any cluster, without the user/service principal or whatever having to register the function everytime they start a cluster.


This way, we can add/update/remove functions without users having to remember that they have to register the functions before using them.

Is that possible to do?


Also, regarding your reply:
You wrote: "2. Read this Python file in Databricks". What does this mean? Where, and how do you do this?

Best regards,
IvanK

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!