cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
mjohns
Databricks Employee
Databricks Employee

Introduction: Modern Data Engineering has a Location Problem

In the world of data engineering, the "What" and "When" are often handled with ease. We know what was bought and when it was delivered. But for industries like logistics, retail, and telecommunications, the most critical question is often "Where?"

Historically, processing spatial data at scale has been a "bolt-on" experience. Data engineers had to juggle external libraries like Apache Sedona, manage complex UDFs, or deal with the overhead of manual spatial indexing. These silos led to fragmented pipelines and performance bottlenecks.

Databricks Lakeflow SDP (Spark Declarative Pipelines) now bridges that gap. Native spatial types and functions mean your data engineers can write geospatial logic in the same data pipelines they already run; no special tools, no additional libraries, and no context switching.

The Lakeflow Advantage: Connect, Pipeline, and Orchestrate

Databricks Lakeflow was built to simplify the entire data lifecycle. By bringing spatial support into Lakeflow SDP, we’ve eliminated the friction of moving between specialized GIS tools and your production ETL.

  • Native Spatial Types: Support for GEOMETRY and GEOGRAPHY types means no more converting coordinates to strings or complex arrays.
  • Photon-Accelerated: Spatial Predicates are vectorized in C++, allowing the engine to process up to millions of geometries per second.
  • Declarative Simplicity: You define what the spatial transformation is; Lakeflow SDP handles the how, including state management and incremental processing.

The Foundation: Native Spatial SQL & Data Types

Most spatial pipelines store coordinates as doubles, string, or binary, and hope for the best. With the recent update to Lakeflow SDP, storing and processing spatial data has changed at the engine level. Lakeflow SDP now utilizes native column types that maintain bounding box and spatial reference metadata with every value. That metadata allows for Data Skipping and Spatial Joins that are significantly faster than 3rd Party libraries. Lakeflow SDP supports:

  • GEOMETRY: For planar coordinate systems (X, Y), ideal for local maps.
  • GEOGRAPHY: For spherical coordinates (Longitude, Latitude) on the Earth’s surface, essential for global logistics.

Once your columns are spatial types, you operate on them with ST_ functions, the same function family used across PostGIS and other spatial databases, so there's nothing new to learn. Databricks ships over 90 of these across various categories: constructors (ST_Point, ST_GeomFromWKT), predicates (ST_Intersects, ST_Contains), measurements (ST_Area, ST_Distance, ST_Length), and more.

The key shift: these functions run inside your Lakeflow SDP pipelines, not in a sidecar library or a post-processing notebook. You ingest, transform, and query spatial data in the same declarative SQL you already use for everything else.

Overview of the 90+ spatial sql functions available in Databricks.Overview of the 90+ spatial sql functions available in Databricks.

Performance Spotlight: Fast Spatial Joins

A common operation in geospatial engineering is the Spatial Join—for example, determining which delivery truck is currently inside which service zone. Databricks Spatial SQL (including Python and Scala bindings) outperforms traditional library-based approaches by leveraging native bounding box metadata and engine-level spatial indexing. Because these optimizations are built directly into the product, you can achieve high-performance out of the box without the headache of manual partitioning or managing external libraries. It’s all about making the heavy lifting automatic, so you can spend more time putting your location data to work.

Benchmark Chart showing improvement over a traditional library.Benchmark Chart showing improvement over a traditional library.

Real-World Implementation

1. Simulation: Austin Couriers

To demonstrate these capabilities, we implemented a real-time courier delivery simulation. This demo tracks a fleet of 50 couriers across Austin, TX, navigating a road network derived from Overture Maps. The simulation can be configured to work with other named regions and to control various properties, including how much chaos to introduce. The simulation that emits the events is driven from austin_delivery_sim notebook.

source = OvertureSource.local(OVERTURE_DIR)
_sim_names = {f.name for f in fields(SimConfig)}
_cfg = dict(
   overture=source,
   output=SimulationOutput(output_path=OVERTURE_DIR),
   bounds=BBOX,
   area_name="Austin, TX",
   region="US-TX",
   # Fleet
   num_couriers=50,
   assign_closest_courier=True,  # False => random among available (ignores mode)
   assign_courier_mode="random", # closest | round_robin | random
   unavailable_after_delivery_sec=5.0,
   # Routing
   speed_km_h=73.0,      # (default - this is 45 mph equivalent)
   use_osmnx=True,       # OSMnx planned_route per assignment (default)
   assignment_use_osmnx=True,
   turn_penalty_sec=0.0, # to speed up the demo (default is 10 sec)
   # GPS
   gps_interval_sec=10.0,
   perturbation_meters=10.0,
   # Dispatch
   delivery_frequency_sec=0.5,
   assignments_per_dispatch_wave=0,
   min_assignment_separation_m=15.0,
   # Chaos
   delay_rate_target_pct=4.0,      # — ~4% slow bucket (0.48–0.72 factor)
   max_abandonments_per_tick=1,
   progress_factor_std=0.11,       # - tight normal spread (0.11)
   slow_progress_factor_min=0.48,
   slow_progress_factor_max=0.72,
   abandon_check_after_sec=60.0,
   abandon_check_route_frac=0.38,
   abandonment_threshold_pct=30.0, # - abandon at 1.3× expected
   abandon_hold_sec=18.0,
   assignment_route_workers=6,     # parallel route solves per wave
   arrival_time_threshold_pct=10.0,
   delivery_delayed_interval_sec=30.0,
   delivery_early_interval_sec=30.0,
   # Runtime
   sim_duration_sec=1200.0,        # 20-minute live run
   clear_output_on_run=True,
)
config = SimConfig(**{k: _cfg[k] for k in _cfg if k in _sim_names})
mgr = SimulationManager(config, spark=spark)
summary = mgr.initialize()
print(json.dumps(summary, indent=2))

 2. Lakeflow SDP: Ingestion with Native Types

The simulation emits raw JSON events (GPS pings, assignments, and completions) into a Unity Catalog Volume. Lakeflow SDP then picks up these events using Auto Loader to build a live spatial gold layer.

Courier Deliveries Lakeflow data pipeline (click for higher res).Courier Deliveries Lakeflow data pipeline (click for higher res).

Here is the Lakeflow SDP Streaming Table that ingests the raw GPS pings from the courier delivery simulation. The table reads JSON files continuously from the delivery_status/gps/ directory on a Unity Catalog Volume. Each row is one ping from a courier en route to a delivery, emitted every 10 seconds (default gps_interval_sec). The Source path is configured via the pipeline parameter event_base_path, e.g. '/Volumes/<catalog>/<schema>/<volume>/overture/austin_default/delivery_status'.

The JSON fields (from GpsEvent dataclass):

  • event_type STRING — always "gps"
  • courier_id STRING — unique courier identifier
  • delivery_id STRING — active delivery identifier
  • longitude DOUBLE — perturbed WGS84 longitude
  • latitude DOUBLE — perturbed WGS84 latitude
  • timestamp_iso STRING — ISO-8601 simulation timestamp
CREATE OR REFRESH STREAMING TABLE gps_event
COMMENT "Live GPS position events from the courier delivery simulation. Each row is one GPS ping per courier per gps_interval_sec."
AS
SELECT
  event_type,
  courier_id,
  delivery_id,
  longitude,
  latitude,
  ST_Point(longitude, latitude, 4326)    AS point,
  timestamp_iso,
  CAST(timestamp_iso AS TIMESTAMP) AS event_ts
FROM STREAM read_files(
  '${event_base_path}/gps/',
  format                        => 'json',
  schema                        => 'event_type STRING, courier_id STRING, delivery_id STRING, longitude DOUBLE, latitude DOUBLE, timestamp_iso STRING',
  includeExistingFiles          => true
)

Lakeflow SDP Materialized Views are used to visualize 100m and 1km geofences around each delivery POI—via ST_Buffer—as well as mantaining active routes (see repo for more).

3. Databricks Apps: From Pipelines to Insights

Data engineering doesn't end at the table. To close the loop, we deployed a Databricks App that queries these Lakeflow Materialized Views. Using Pydeck, the app visualizes the road network (Overture Maps), the active courier routes, and the real-time geofence status—refreshing every 15 seconds as the pipeline processes new pings.

The app provides global stats as the simulation progresses to readily measure in-flight deliveries and how many have been completed or abandoned. There is per minute Throughput windowed calculations for deliveries and GPS pings. Also, POIs and Couriers can be post-filtered for more granular tracking or triaging of delivery issues. 

Databricks App UI - Couriers moving along their assigned routes towards delivery POIs (click for higher res).Databricks App UI - Couriers moving along their assigned routes towards delivery POIs (click for higher res).

A key requirement for logistics is knowing when a courier is approaching a Point of Interest (POI). We use ST_DWithin for the actual distance tests between GPS pings and POIs, see queries.py. Here is a culminating query used by the app to reflect various GPS ping status for active couriers and whether they are within 1km or 100m of their target POI. Note: queries are composed in python, so there are some variables injected.

WITH latest_gps AS (
  SELECT
    courier_id,
    delivery_id,
    longitude,
    latitude,
    event_ts,
    ROW_NUMBER() OVER (PARTITION BY courier_id ORDER BY event_ts DESC) AS rn
  FROM {_table("gps_event")}
  WHERE {_vp_g}
),
active AS (
  SELECT courier_id, poi_id, target_lon, target_lat FROM {_table("active_route")}
),
in_1km AS (
  SELECT DISTINCT g.courier_id
  FROM latest_gps g
  JOIN active ar ON g.courier_id = ar.courier_id
  WHERE g.rn = 1
    AND ST_DWithin(
      ST_Point(ar.target_lon, ar.target_lat, 4326),
      ST_Point(g.longitude, g.latitude, 4326),
      {d1}
    )
),
in_100m AS (
  SELECT DISTINCT g.courier_id
  FROM latest_gps g
  JOIN active ar ON g.courier_id = ar.courier_id
  WHERE g.rn = 1
    AND ST_DWithin(
      ST_Point(ar.target_lon, ar.target_lat, 4326),
      ST_Point(g.longitude, g.latitude, 4326),
      {d100}
    )
)
SELECT
  g.courier_id,
  g.delivery_id,
  g.longitude,
  g.latitude,
  g.event_ts,
  CAST(DATEDIFF(MINUTE, g.event_ts, current_timestamp()) AS INT) AS age_minutes,
  (i1.courier_id   IS NOT NULL) AS in_1km_geofence,
  (i100.courier_id IS NOT NULL) AS in_100m_geofence,
  (ar.courier_id   IS NOT NULL) AS has_active_route
FROM latest_gps g
LEFT JOIN active  ar   ON g.courier_id = ar.courier_id
LEFT JOIN in_1km  i1   ON g.courier_id = i1.courier_id
LEFT JOIN in_100m i100 ON g.courier_id = i100.courier_id
WHERE g.rn = 1;

Architecting for the Future

By unifying spatial intelligence with the Lakeflow engine, we are removing the "spatial tax" that has traditionally slowed down innovation. Whether you are calculating flood risks, optimizing delivery routes, or analyzing cell tower coverage, Lakeflow SDP provides the scale and simplicity needed to turn raw coordinates into actionable business insights.

Ready to get started?

1 Comment