In the world of data engineering, the "What" and "When" are often handled with ease. We know what was bought and when it was delivered. But for industries like logistics, retail, and telecommunications, the most critical question is often "Where?"
Historically, processing spatial data at scale has been a "bolt-on" experience. Data engineers had to juggle external libraries like Apache Sedona, manage complex UDFs, or deal with the overhead of manual spatial indexing. These silos led to fragmented pipelines and performance bottlenecks.
Databricks Lakeflow SDP (Spark Declarative Pipelines) now bridges that gap. Native spatial types and functions mean your data engineers can write geospatial logic in the same data pipelines they already run; no special tools, no additional libraries, and no context switching.
Databricks Lakeflow was built to simplify the entire data lifecycle. By bringing spatial support into Lakeflow SDP, we’ve eliminated the friction of moving between specialized GIS tools and your production ETL.
Most spatial pipelines store coordinates as doubles, string, or binary, and hope for the best. With the recent update to Lakeflow SDP, storing and processing spatial data has changed at the engine level. Lakeflow SDP now utilizes native column types that maintain bounding box and spatial reference metadata with every value. That metadata allows for Data Skipping and Spatial Joins that are significantly faster than 3rd Party libraries. Lakeflow SDP supports:
Once your columns are spatial types, you operate on them with ST_ functions, the same function family used across PostGIS and other spatial databases, so there's nothing new to learn. Databricks ships over 90 of these across various categories: constructors (ST_Point, ST_GeomFromWKT), predicates (ST_Intersects, ST_Contains), measurements (ST_Area, ST_Distance, ST_Length), and more.
The key shift: these functions run inside your Lakeflow SDP pipelines, not in a sidecar library or a post-processing notebook. You ingest, transform, and query spatial data in the same declarative SQL you already use for everything else.
Overview of the 90+ spatial sql functions available in Databricks.
A common operation in geospatial engineering is the Spatial Join—for example, determining which delivery truck is currently inside which service zone. Databricks Spatial SQL (including Python and Scala bindings) outperforms traditional library-based approaches by leveraging native bounding box metadata and engine-level spatial indexing. Because these optimizations are built directly into the product, you can achieve high-performance out of the box without the headache of manual partitioning or managing external libraries. It’s all about making the heavy lifting automatic, so you can spend more time putting your location data to work.
Benchmark Chart showing improvement over a traditional library.
To demonstrate these capabilities, we implemented a real-time courier delivery simulation. This demo tracks a fleet of 50 couriers across Austin, TX, navigating a road network derived from Overture Maps. The simulation can be configured to work with other named regions and to control various properties, including how much chaos to introduce. The simulation that emits the events is driven from austin_delivery_sim notebook.
source = OvertureSource.local(OVERTURE_DIR)
_sim_names = {f.name for f in fields(SimConfig)}
_cfg = dict(
overture=source,
output=SimulationOutput(output_path=OVERTURE_DIR),
bounds=BBOX,
area_name="Austin, TX",
region="US-TX",
# Fleet
num_couriers=50,
assign_closest_courier=True, # False => random among available (ignores mode)
assign_courier_mode="random", # closest | round_robin | random
unavailable_after_delivery_sec=5.0,
# Routing
speed_km_h=73.0, # (default - this is 45 mph equivalent)
use_osmnx=True, # OSMnx planned_route per assignment (default)
assignment_use_osmnx=True,
turn_penalty_sec=0.0, # to speed up the demo (default is 10 sec)
# GPS
gps_interval_sec=10.0,
perturbation_meters=10.0,
# Dispatch
delivery_frequency_sec=0.5,
assignments_per_dispatch_wave=0,
min_assignment_separation_m=15.0,
# Chaos
delay_rate_target_pct=4.0, # — ~4% slow bucket (0.48–0.72 factor)
max_abandonments_per_tick=1,
progress_factor_std=0.11, # - tight normal spread (0.11)
slow_progress_factor_min=0.48,
slow_progress_factor_max=0.72,
abandon_check_after_sec=60.0,
abandon_check_route_frac=0.38,
abandonment_threshold_pct=30.0, # - abandon at 1.3× expected
abandon_hold_sec=18.0,
assignment_route_workers=6, # parallel route solves per wave
arrival_time_threshold_pct=10.0,
delivery_delayed_interval_sec=30.0,
delivery_early_interval_sec=30.0,
# Runtime
sim_duration_sec=1200.0, # 20-minute live run
clear_output_on_run=True,
)
config = SimConfig(**{k: _cfg[k] for k in _cfg if k in _sim_names})
mgr = SimulationManager(config, spark=spark)
summary = mgr.initialize()
print(json.dumps(summary, indent=2))
The simulation emits raw JSON events (GPS pings, assignments, and completions) into a Unity Catalog Volume. Lakeflow SDP then picks up these events using Auto Loader to build a live spatial gold layer.
Courier Deliveries Lakeflow data pipeline (click for higher res).
Here is the Lakeflow SDP Streaming Table that ingests the raw GPS pings from the courier delivery simulation. The table reads JSON files continuously from the delivery_status/gps/ directory on a Unity Catalog Volume. Each row is one ping from a courier en route to a delivery, emitted every 10 seconds (default gps_interval_sec). The Source path is configured via the pipeline parameter event_base_path, e.g. '/Volumes/<catalog>/<schema>/<volume>/overture/austin_default/delivery_status'.
The JSON fields (from GpsEvent dataclass):
CREATE OR REFRESH STREAMING TABLE gps_event
COMMENT "Live GPS position events from the courier delivery simulation. Each row is one GPS ping per courier per gps_interval_sec."
AS
SELECT
event_type,
courier_id,
delivery_id,
longitude,
latitude,
ST_Point(longitude, latitude, 4326) AS point,
timestamp_iso,
CAST(timestamp_iso AS TIMESTAMP) AS event_ts
FROM STREAM read_files(
'${event_base_path}/gps/',
format => 'json',
schema => 'event_type STRING, courier_id STRING, delivery_id STRING, longitude DOUBLE, latitude DOUBLE, timestamp_iso STRING',
includeExistingFiles => true
)
Lakeflow SDP Materialized Views are used to visualize 100m and 1km geofences around each delivery POI—via ST_Buffer—as well as mantaining active routes (see repo for more).
Data engineering doesn't end at the table. To close the loop, we deployed a Databricks App that queries these Lakeflow Materialized Views. Using Pydeck, the app visualizes the road network (Overture Maps), the active courier routes, and the real-time geofence status—refreshing every 15 seconds as the pipeline processes new pings.
The app provides global stats as the simulation progresses to readily measure in-flight deliveries and how many have been completed or abandoned. There is per minute Throughput windowed calculations for deliveries and GPS pings. Also, POIs and Couriers can be post-filtered for more granular tracking or triaging of delivery issues.
Databricks App UI - Couriers moving along their assigned routes towards delivery POIs (click for higher res).
A key requirement for logistics is knowing when a courier is approaching a Point of Interest (POI). We use ST_DWithin for the actual distance tests between GPS pings and POIs, see queries.py. Here is a culminating query used by the app to reflect various GPS ping status for active couriers and whether they are within 1km or 100m of their target POI. Note: queries are composed in python, so there are some variables injected.
WITH latest_gps AS (
SELECT
courier_id,
delivery_id,
longitude,
latitude,
event_ts,
ROW_NUMBER() OVER (PARTITION BY courier_id ORDER BY event_ts DESC) AS rn
FROM {_table("gps_event")}
WHERE {_vp_g}
),
active AS (
SELECT courier_id, poi_id, target_lon, target_lat FROM {_table("active_route")}
),
in_1km AS (
SELECT DISTINCT g.courier_id
FROM latest_gps g
JOIN active ar ON g.courier_id = ar.courier_id
WHERE g.rn = 1
AND ST_DWithin(
ST_Point(ar.target_lon, ar.target_lat, 4326),
ST_Point(g.longitude, g.latitude, 4326),
{d1}
)
),
in_100m AS (
SELECT DISTINCT g.courier_id
FROM latest_gps g
JOIN active ar ON g.courier_id = ar.courier_id
WHERE g.rn = 1
AND ST_DWithin(
ST_Point(ar.target_lon, ar.target_lat, 4326),
ST_Point(g.longitude, g.latitude, 4326),
{d100}
)
)
SELECT
g.courier_id,
g.delivery_id,
g.longitude,
g.latitude,
g.event_ts,
CAST(DATEDIFF(MINUTE, g.event_ts, current_timestamp()) AS INT) AS age_minutes,
(i1.courier_id IS NOT NULL) AS in_1km_geofence,
(i100.courier_id IS NOT NULL) AS in_100m_geofence,
(ar.courier_id IS NOT NULL) AS has_active_route
FROM latest_gps g
LEFT JOIN active ar ON g.courier_id = ar.courier_id
LEFT JOIN in_1km i1 ON g.courier_id = i1.courier_id
LEFT JOIN in_100m i100 ON g.courier_id = i100.courier_id
WHERE g.rn = 1;
By unifying spatial intelligence with the Lakeflow engine, we are removing the "spatial tax" that has traditionally slowed down innovation. Whether you are calculating flood risks, optimizing delivery routes, or analyzing cell tower coverage, Lakeflow SDP provides the scale and simplicity needed to turn raw coordinates into actionable business insights.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.