Databricks Community

IvanK · ‎09-19-2023

Hello,

I am trying to create a permanent UDF from a Python file with dependencies that are not part of the standard Python library.

How do I make use of CREATE FUNCTION (External) [1] to create a permanent function in Databricks, using a Python file that contains my function?

NOTE: If I understand it correctly, CREATE FUNCTION (SQL and Python) [2], will not work in our case because dependencies are limited to the libraries defined in [2]. We also want to automate this, meaning that I will be writing/updating our functions in a gitrepo and use CI/CD to create/update the permanent functions in Databricks.

References

[1] https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-function.html

[2] https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-sql-function.html#suppo...

IvanK · ‎09-20-2023

Hello @Retired_mod ,
Thank you for the reply but I do not think this will solve our problem, let me give you some more background info:

We are going to use the orchestration tool Prefect to start clusters in Databricks and then run dbt on the started clusters to perform e.g. transformation of data.
The transformations we will use are written in Python scripts (it is way too complex to write these transformations in SQL). The Python scripts will make use of other libraries, and we write/update our Python scripts in our Git repo (outside of Databricks).

As I understand, Databricks clusters that we start must have an active SparkSession in order to be able to register UDFs. An active SparkSession is only created when connecting a Notebook to the cluster (correct me if I am wrong). We do not want to use Notebooks, as it adds more complexity.

Also, UDF is only created in a single SparkSession. This means that you have to register the UDF every time you start a cluster or create a new cluster (which is not convenient). (NOTE: Unity Catalog defined Python UDFs are limited to libraries in the Databricks Runtime, and as we have other libraries we need to use, defining Python UDFs in Unity Catalog is not an option for us)

So to get around this problem, we would like to create a permanent function, that can be used by any cluster, without the user/service principal or whatever having to register the function everytime they start a cluster.

This way, we can add/update/remove functions without users having to remember that they have to register the functions before using them.

Is that possible to do?

Also, regarding your reply:
You wrote: "2. Read this Python file in Databricks". What does this mean? Where, and how do you do this?

Best regards,
IvanK