Databricks

gbrueckl · ‎10-14-2021

Is it somehow possible to create an SQL external function using Python code?

the examples only show how to use JARs

https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-functio...

something like:

CREATE TEMPORARY FUNCTION simple_temp_udf AS 'SimpleUdf' USING FILE '/tmp/SimpleUdf.py';

-werners- · ‎10-15-2021

I would think the USING FILE would work.

As long as you follow the class_name requirements.

The implementing class should extend one of the base classes as follows:

Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
Should extend AbstractGenericUDAFResolver, GenericUDF, or GenericUDTF in org.apache.hadoop.hive.ql.udf.generic package.
Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.

Also the docs literally state python is possible:

In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate functions (UDAFs) for more information.

So it should be possible, maybe your python class does not meet the requirements?

View solution in original post

-werners- · ‎10-15-2021

I would think the USING FILE would work.

As long as you follow the class_name requirements.

The implementing class should extend one of the base classes as follows:

Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
Should extend AbstractGenericUDAFResolver, GenericUDF, or GenericUDTF in org.apache.hadoop.hive.ql.udf.generic package.
Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.

Also the docs literally state python is possible:

In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate functions (UDAFs) for more information.

So it should be possible, maybe your python class does not meet the requirements?

Mumu · ‎01-27-2022

For python which class to extend then? All of the listed parent classes are java

-werners- · ‎01-31-2022

for pyspark you can use udf().

Here is an example on how to do this.

Mumu · ‎02-01-2022

Thanks for your response. What I am looking for is to define a view with the UDF. However, a session level UDF as described in this example you provided does not seem to allow that. Maybe I should clarify my question as to define a external UDF like those Hive ones.

Anonymous · ‎01-31-2022

@Wugang Xu - My name is Piper, and I'm a moderator here for Databricks. Thanks for coming to us with your question. We'll give the members a bit longer to respond and come back if we need to. Thanks in advance for your patience. 🙂

pts · ‎02-04-2022

As a user of your code, I'd find it a less pleasant API because I'd have to some_module.some_func.some_func() rather than just some_module.some_func()

No reason to have "some_func" exist twice in the hierarchy. It's kind of redundant. If some_func is so large that adding any more ocde to the file seems crazy, maybe some_func is too large and you want to refactor it and simplify it.

Having one file serve one purpose makes sense. Having it literally have only a single function and nothing else is pretty unusual.

Databricks

CREATE FUNCTION from Python file

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI