cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

CREATE FUNCTION from Python file

gbrueckl
Contributor II

Is it somehow possible to create an SQL external function using Python code?

the examples only show how to use JARs

https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-functio...

something like:

CREATE TEMPORARY FUNCTION simple_temp_udf AS 'SimpleUdf' USING FILE '/tmp/SimpleUdf.py';

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

I would think the USING FILE would work.

As long as you follow the class_name requirements.

The implementing class should extend one of the base classes as follows:

  • Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
  • Should extend AbstractGenericUDAFResolver, GenericUDF, or GenericUDTF in org.apache.hadoop.hive.ql.udf.generic package.
  • Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.

Also the docs literally state python is possible:

In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate functions (UDAFs) for more information.

So it should be possible, maybe your python class does not meet the requirements?

View solution in original post

6 REPLIES 6

-werners-
Esteemed Contributor III

I would think the USING FILE would work.

As long as you follow the class_name requirements.

The implementing class should extend one of the base classes as follows:

  • Should extend UDF or UDAF in org.apache.hadoop.hive.ql.exec package.
  • Should extend AbstractGenericUDAFResolver, GenericUDF, or GenericUDTF in org.apache.hadoop.hive.ql.udf.generic package.
  • Should extend UserDefinedAggregateFunction in org.apache.spark.sql.expressions package.

Also the docs literally state python is possible:

In addition to the SQL interface, Spark allows you to create custom user defined scalar and aggregate functions using Scala, Python, and Java APIs. See User-defined scalar functions (UDFs) and User-defined aggregate functions (UDAFs) for more information.

So it should be possible, maybe your python class does not meet the requirements?

Mumu
New Contributor II

For python which class to extend then? All of the listed parent classes are java

-werners-
Esteemed Contributor III

for pyspark you can use udf().

Here is an example on how to do this.

Mumu
New Contributor II

Thanks for your response. What I am looking for is to define a view with the UDF. However, a session level UDF as described in this example you provided does not seem to allow that. Maybe I should clarify my question as to define a external UDF like those Hive ones.

Anonymous
Not applicable

@Wugang Xu​ - My name is Piper, and I'm a moderator here for Databricks. Thanks for coming to us with your question. We'll give the members a bit longer to respond and come back if we need to. Thanks in advance for your patience. 🙂

pts
New Contributor II

As a user of your code, I'd find it a less pleasant API because I'd have to some_module.some_func.some_func() rather than just some_module.some_func()

No reason to have "some_func" exist twice in the hierarchy. It's kind of redundant. If some_func is so large that adding any more ocde to the file seems crazy, maybe some_func is too large and you want to refactor it and simplify it.

Having one file serve one purpose makes sense. Having it literally have only a single function and nothing else is pretty unusual.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.