Databricks Community

PL_db · ‎05-01-2025

Data analysis often involves complex transformations and analyses that can be difficult to read and hard to maintain in SQL. For example, financial analysts may need to apply dynamic interest and exchange rates, data engineers face challenges in validating data formats, and supply chain analysts may have to compute the distance between different geospatial locations. Manually implementing these calculations and transformations in each SQL query can be cumbersome and error-prone.
Custom functions in Databricks SQL provide an elegant solution to wrap complex logic into reusable modules in SQL and Python. By creating custom user-defined functions (UDFs), you can simplify queries, improve readability, and enhance maintainability.

In this blog, we’ll illustrate the benefits and applications of custom functions in Databricks SQL by using the practical example of geospatial distance calculation.

SQL Scalar Functions

Let's consider the example of calculating the geospatial distance between two coordinates, such as Munich and Berlin. To reuse these coordinates throughout this post, we’ll declare them as variables:

-- Munich
DECLARE VAR MUC_LAT DECIMAL(6, 4) = 48.1351;
DECLARE VAR MUC_LNG DECIMAL(6, 4) = 11.5755;

-- Berlin
DECLARE VAR BER_LAT DECIMAL(6, 4) = 52.5200;
DECLARE VAR BER_LNG DECIMAL(6, 4) = 13.4050;

One way to calculate the distance between two points on the surface of a sphere (like the Earth) is by using the Spherical Law of Cosines:

d=arccos⁡(sin⁡(ϕ1)⋅sin⁡(ϕ2)+cos⁡(ϕ1)⋅cos⁡(ϕ2)⋅cos⁡(λ2−λ1))⋅R

Where:

d is the distance between the two points.
ϕ1and ϕ2 are the latitudes of the two points in radians.
λ1 and λ2 are the longitudes of the two points in radians.
R is the radius of the sphere (e.g., approximately 6371 km for the Earth).

In Databricks SQL, this calculation can be done using standard SQL functions. Note the use of the radians function to convert the coordinates to radians:

-- Calculation using spherical law of cosines
SELECT
  (
     6371 -- radius of the earth in km
     * acos(
       sin(radians(MUC_LAT)) * sin(radians(BER_LAT))
       + cos(radians(MUC_LAT)) * cos(radians(BER_LAT)) * cos(radians(MUC_LNG) - radians(BER_LNG))
     )
  ) AS distance

The query returns a result of approximately 504.53 km! (Google maps route) Now, let's assume this calculation is repeated across the organization. Instead of rewriting or copying this calculation again and again, we can define a reusable function using Databricks SQL syntax.

Scalar functions are ideal for row-by-row calculations and are designed to take one or more inputs and return a single value of a specific data type. Here's how we can define a UDF for calculating the spherical distance:

CREATE FUNCTION spherical_distance(
 lat1 DECIMAL(6, 4),
 lng1 DECIMAL(6, 4),
 lat2 DECIMAL(6, 4),
 lng2 DECIMAL(6, 4)
)
RETURNS DECIMAL(12, 2)
COMMENT 'Calculates the spherical distance between point 1 (lat1, lng1) and point 2 (lat2, lng2) on earth.'
RETURN (
 6371 -- radius of the earth in km
 * acos(
   sin(radians(lat1)) * sin(radians(lat2))
   + cos(radians(lat1)) * cos(radians(lat2)) * cos(radians(lng1) - radians(lng2))
 )
);

SELECT spherical_distance(MUC_LAT, MUC_LNG, BER_LAT, BER_LNG) AS distance;

The query for calculating the distance between Munich and Berlin is now a lot more readable and maintainable.

Once the UDF is created, it is automatically registered in Unity Catalog and can be centrally managed. This allows you to grant individual permissions for users and groups by assigning or revoking EXECUTE and MANAGE privileges.

SQL Table-valued Functions

Unlike regular functions, which are designed to operate on a row level, table-valued functions return a set of rows as a table reference for a given input. One simple example of a table-valued function is the range function. The below SQL statement returns a table with one column and ten rows with values ranging from 1 to 10.

SELECT range(10);

In Databricks SQL, UDFs can also be defined as table-valued functions. To demonstrate this, let's stick with the calculation of the spherical distance. However, in this case, let's consider the – admittedly artificial – example of having to calculate the distance between Munich and a number of evenly spaced coordinates on the equator.

We can achieve this in two steps using UDFs:

Create a table-valued function that will return a given number of evenly spaced coordinates on the equator and persist the output of that function to a table.
Leverage our previously defined function spherical_distance to calculate the distance for every single coordinate in the table in one SELECT-statement.

For the first step, we define a function taking a single parameter n, representing the number of coordinates to evenly distribute along the equator. We declare this function as a table-valued function using the RETURNS TABLE syntax and define three output columns: coordinate ID, latitude and longitude. In the function body we leverage the explode and sequence functions to create the desired output. In this example we set n = 40075, which corresponds to the earth’s circumference in kilometers, resulting in 40075 records – one coordinate for every kilometer along the equator.

CREATE FUNCTION evenly_spaced_equator_coordinates(n INT)
RETURNS TABLE (COORDINATE_ID INT, LAT DECIMAL(7, 4), LNG DECIMAL(7, 4))
COMMENT 'Distributes n points evenly along the equator and returns a table with the coordinates of those points.'
RETURN
 SELECT
   EXPLODE(SEQUENCE(0, n - 1)) AS COORDINATE_ID,
   0 AS LAT,
   COORDINATE_ID * 360 / n AS LNG;

SELECT * FROM evenly_spaced_equator_coordinates(40075); -- circumference of the earth in km -> one coordinate every km

Using a simple CREATE TABLE AS SELECT (CTAS) statement, we can easily persist the output to a table:

CREATE TABLE equator_coordinates AS

SELECT * FROM evenly_spaced_equator_coordinates(40075);

CREATE TABLE equator_coordinates AS
 SELECT * FROM evenly_spaced_equator_coordinates(40075);

With the equator coordinates table in place, the final step is a simple SELECT statement with our spherical distance function.

SELECT
COORDINATE_ID,
LAT,
LNG,
spherical_distance(MUC_LAT, MUC_LNG, LAT, LNG) AS DISTANCE_TO_MUC
FROM 
equator_coordinates

SQL Python UDFs

While we’ve demonstrated the useful capabilities of SQL Scalar and Table functions, there are scenarios where SQL alone may not be sufficient or impractical. This is particularly the case when you want to implement logic that is more easily expressed in Python or when the desired functionality already exists in a Python library. For such scenarios, Python UDFs in Databricks SQL offer a seamless way to integrate Python capabilities directly into your SQL queries. Python UDFs support importing Python libraries and the implementation of your custom logic using the Python language.
Like in our previous example, let’s consider the distance calculation between two geospatial coordinates.

In a Python UDF, you can simply import the “haversine_distances” function from the “sklearn” library, thereby eliminating the need to manually implement the formula. Here is how you can easily define such a Python UDF in Databricks SQL:

CREATE OR REPLACE FUNCTION haversine_udf(
 lat1 DECIMAL(6, 4),
 lng1 DECIMAL(6, 4),
 lat2 DECIMAL(6, 4),
 lng2 DECIMAL(6, 4)
) RETURNS DECIMAL(12, 2) LANGUAGE PYTHON AS $$

"""
Parameters:

haversine_udf (version 1.0):
- lat1 (decimal): Latitude of first point in degrees.
- lng1 (decimal): Longitude of first point in degrees.
- lat2 (decimal): Latitude of second point in degrees.
- lng2 (decimal): Longitude of second point in degrees.

Returns:
- decimal: The calculated distance between the two points in km.

Example Usage:
SELECT haversine_udf(MUC_LAT, MUC_LNG, BER_LAT, BER_LNG) AS distance;

Change Log:
- 1.0: Initial version.

"""
from sklearn.metrics.pairwise import haversine_distances
import math
from decimal import Decimal

# Convert coordinates to radians
point1 = [math.radians(lat1), math.radians(lng1)]
point2 = [math.radians(lat2), math.radians(lng2)]

# Calculate Haversine distances
distances = haversine_distances([point1, point2])

# Convert to kilometers
earth_radius_km = 6371
distance_km = distances[0, 1] * earth_radius_km

return Decimal(str(round(distance_km, 2)))
$$

Using this Python UDF is straightforward, as previously demonstrated with the scalar and table functions:

SELECT haversine_udf(MUC_LAT, MUC_LNG, BER_LAT, BER_LNG) AS distance;

As we can see, the Python implementation provides the same resulting distance as the scalar function implementation: 504.53 km

Just like Scalar and Table functions, Python UDFs can be managed and governed in Unity Catalog, allowing you to grant access to specific users and groups.

Performance considerations

Despite offering a useful option to express complex logic, Python UDFs are not as performant and scalable as optimized SQL language in Databricks SQL. As Python UDFs require data to be serialized and moved out of the JVM to the Python interpreter and back, they create additional overhead and miss Spark-level optimizations, making them less performant as data volumes increase.

Limitations

Currently, only a subset of Python libraries is available for Databricks SQL Python UDFs, including popular libraries such as pandas, numpy, or scikit-learn. You can find a full list of available libraries here.

Conclusion

In scenarios where no built-in functions are available, custom SQL functions in Databricks SQL offer a powerful option to wrap complex logic and improve the readability and reusability of your queries. Among other use cases, leveraging these functions can help you simplify recurring calculations such as geospatial distance, currency, or metric conversions, providing a consolidated approach across your organization. While Python UDFs are not recommended for performance-relevant queries on larger datasets, they offer a useful tool for small to medium datasets.

Call to action

Start experimenting with your first custom SQL functions and Python UDFs in Databricks SQL today, think about potential use cases, and help your organization to make complex transformations reusable and maintainable.