In Part 1, we explored core concepts and datasets driving geospatial analytics in the automotive industry. In Part 2, we’ll dive into practical steps for building scalable geospatial pipelines using AI, ML, and synthetic data—while maintaining governance and performance on Databricks.
We’ll focus on real code and architecture patterns that bring these ideas to life in production-ready automotive and mobility solutions.
The Databricks’ Data Intelligence Platform combines powerful geospatial analytics and AI to deliver scalable, real-time insights. With features like Liquid Clustering and H3 spatial indexing, it enables fast and efficient processing of massive geospatial datasets. Built-in geospatial functions simplify spatial tasks such as mapping traffic patterns or assessing road risk. AutoML accelerates model development for use cases like predicting aggressive driving by factoring in weather, traffic, and road conditions. The platform also ensures strong governance through Unity Catalog (UC), which manages data access and sharing securely. Tools like AI Query and UC-governed functions make it easy to extract structured geolocation data from unstructured sources.
This post will focus on a full geospatial analytics pipeline built on the Databricks Data Intelligence Platform. Below we illustrate the medallion pipeline combining geospatial data, LLMs, and Genie for conversational insights.
Geospatial data ingestion at scale in Databricks is seamless thanks to the platform integration with a wide range of geospatial libraries and tools. Databricks geospatial functions are specifically designed to enhance spatial data handling. Auto Loader is the go-to option to process billions of files from cloud storage, while synthetic data generation can serve as an alternative during development.
Telematics is a strong use case for synthetic data because it enables realistic testing and model development without exposing sensitive or personal vehicle information. While synthetic data can be built using any SQL or Python logic depending on a developer’s creativity, the Databricks Labs Data Generator (dbldatagen) library makes this process significantly easier. It provides a declarative interface for creating large, scalable synthetic datasets directly on Spark.
In the example below, we use dbldatagen to simulate 1 million rows of telematics data. This setup enables developers to generate realistic datasets for modeling and testing without relying on production data.
%pip install dbldatagen
from dbldatagen import DataGenerator
from pyspark.sql.functions import col, when, rand
data_gen = (DataGenerator(spark, rows=1000000, partitions=8)
.withColumn("vehicle_id", "bigint", minValue=1, maxValue=1000000)
.withColumn("vin_number", "string", template='A-HJ-NPR-Z0-9')
.withColumn("timestamp", "timestamp", begin="2024-01-01 00:00:00", end="2024-12-31 23:59:59", interval="1 second")
.withColumn("latitude", "double", minValue=40.477399, maxValue=45.01585)
.withColumn("longitude", "double", minValue=-79.76259, maxValue=-71.7517)
.withColumn("vehicle_speed", "double", minValue=0, maxValue=200)
.withColumn("abs", "boolean", expr="floor(rand() * 2)")
.withColumn("tpms", "boolean", expr="floor(rand() * 2)")
.withColumn("brake_pad", "boolean", expr="floor(rand() * 2)")
.withColumn("accelerometer", "boolean", expr="floor(rand() * 2)")
.withColumn("blind_spot", "boolean", expr="floor(rand() * 2)")
.withColumn("lane_departure", "boolean", expr="floor(rand() * 2)")
)
df_telematics = data_gen.build()
display(df_telematics)
Route generation enables optimized mobility, safety, and infrastructure planning by identifying efficient, risk-aware paths from geospatial data. In our pipeline, we reconstruct routes between pickup and drop-off points to correlate paths with external factors and gain deeper insight.
In Databricks, developers can use osmnx and networkx—open-source libraries that access OpenStreetMap data and compute optimal paths across street networks. The example below uses these tools with applyInPandas to parallelize routing across Spark executors. We also offer a Solution Accelerator for scalable route generation using an OSRM-equipped Databricks cluster.
Note that this sample code requires a cluster in Dedicated Access Model as we're using sparkContext.broadcast to improve performance by not having to download graph files in each worker executor.
def compute_routes_multi_node(self,trips_df, graph_path):
import pandas as pd
from pyspark.sql.functions import col
from pyspark.sql.types import (
StructType, StructField, LongType, DoubleType, IntegerType, TimestampType
)
route_schema = StructType([
StructField("trip_id", LongType()),
StructField("tpep_pickup_datetime", TimestampType()),
StructField("tpep_dropoff_datetime", TimestampType()),
StructField("path_latitude", DoubleType()),
StructField("path_longitude", DoubleType()),
StructField("order", IntegerType())
])
import osmnx as ox
import networkx as nx
graph = ox.load_graphml(graph_path)
broadcast_graph = trips_df.sparkSession.sparkContext.broadcast(graph)
def route_processor(pdf: pd.DataFrame) -> pd.DataFrame:
import networkx as nx
graph = broadcast_graph.value
results = []
for _, row in pdf.iterrows():
try:
orig_node = ox.nearest_nodes(graph, row['pickup_longitude'], row['pickup_latitude'])
dest_node = ox.nearest_nodes(graph, row['dropoff_longitude'], row['dropoff_latitude'])
route = nx.shortest_path(graph, orig_node, dest_node, weight='length')
lats, lons = zip(*[(graph.nodes[n]['y'], graph.nodes[n]['x']) for n in route])
for i, (lat, lon) in enumerate(zip(lats, lons)):
results.append({
"trip_id": row["id"],
"tpep_pickup_datetime": row["tpep_pickup_datetime"],
"tpep_dropoff_datetime": row["tpep_dropoff_datetime"],
"path_latitude": lat,
"path_longitude": lon,
"order": i
})
except Exception:
continue
return pd.DataFrame(results)
selected_df = trips_df.select(
col("id"),
col("tpep_pickup_datetime"),
col("tpep_dropoff_datetime"),
col("pickup_latitude"),
col("pickup_longitude"),
col("dropoff_latitude"),
col("dropoff_longitude")
).dropna().distinct()
result_df = selected_df.groupBy("id").applyInPandas(route_processor, schema=route_schema)
return result_df
[Route Generation Visualization in a Databricks Notebook]
Databricks simplifies geocoding by using a large language model (LLM) to convert unstructured text—like ZIP codes—into structured geospatial data. With a natural language prompt, the ai_query function calls the databricks-meta-llama-3-70b-instruct endpoint to generate latitude and longitude, without relying on external APIs.
While traditional geocoding tools are recommended to deliver deterministic results, this example shows how easy LLMs enhance geospatial workflows and democratize location intelligence.
trips = spark.table("samples.nyctaxi.trips")
prompt = "I am going to give you a zip code and you need to give me the latitude and longitude of that zip code. Do not say anything else, just return latitude and longitude separated by a column. Example: 10110, Your response should be: 40.7589 ° N, 73.9851 ° W. Do not say anything other than returning latitude,longitude pickup_zip={}"
df_out = trips.selectExpr("*", f"ai_query('databricks-meta-llama-3-3-70b-instruct', CONCAT('{prompt}', pickup_zip), modelParameters => named_struct('max_tokens', 100, 'temperature', 0.7)) as pickup_lat_long")
Geospatial workloads demand flexible indexing to support varied query patterns. Databricks integrates H3 spatial indexing with Liquid Clustering to efficiently handle analytical queries and model training workflows. This combination enables fast filtering on spatial data combined with other attributes—like speed or social determinants—without requiring explicit Z-ordering.
The example below shows how to leverage built-in H3 support with Liquid Clustering. It uses ST_Centroid to compute geometry center points and ST_Transform to convert them to WGS84 coordinates. Then, h3_longlatash3 generates H3 indexes at resolution 9, enabling fast, consistent spatial queries across a hexagonal grid.
WITH computed AS (
SELECT
vol,
ST_Transform(ST_Centroid(ST_GeomFromText(wktgeom, 2263)), 4326) AS geom,
MAKE_TIMESTAMP(2000 + yr, m, d, hh, mm, 0) AS timestamp
FROM {catalog_name}.{schema_name}.traffic_volume_counts_bronze
WHERE wktgeom IS NOT NULL AND vol IS NOT NULL
)
MERGE INTO {catalog_name}.{schema_name}.traffic_volume_counts_silver AS target
USING (
SELECT
vol,
ST_Y(geom) AS lat,
ST_X(geom) AS long,
timestamp,
h3_longlatash3(ST_X(geom), ST_Y(geom), 9) AS h3_index
FROM computed
) AS source
ON target.h3_index = source.h3_index AND target.timestamp = source.timestamp
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
The MERGE INTO operation enables idempotent upserts into silver Delta tables—preventing duplicates when processing the same data multiple times. Combined with CLUSTER BY h3_index, records are colocated based on spatial proximity. Unlike static ZORDER, Liquid Clustering supports dynamic clustering on H3 indexes and fields like timestamps or vehicle metrics without requiring predefined query patterns. This results in faster lookups, efficient filtering, and scalable model training. For more details, refer to Databricks H3 functions and Liquid Clustering documentation.
[Traffic Volume Visualization in Databricks Notebook Using KeplerGl library]
User-defined functions (UDFs) in Unity Catalog offer a secure, governed, and shareable way to perform deterministic geocoding at scale. By centralizing logic—such as converting ZIP codes into latitude and longitude—ensures logic and results remain consistent and auditable across users and workloads. The code below defines a Python-based UDF in Unity Catalog that securely returns the latitude and longitude for a given U.S. ZIP code using a public API.
CREATE OR REPLACE FUNCTION get_lat_long_from_zipcode(zipcode STRING)
RETURNS STRUCT<latitude: DOUBLE, longitude: DOUBLE>
LANGUAGE PYTHON
ENVIRONMENT (
dependencies = '["requests"]',
environment_version = "None"
)
AS $$
import requests
url = f"http://api.zippopotam.us/us/{zipcode}"
response = requests.get(url)
if response.status_code != 200:
return None
data = response.json()
try:
latitude = float(data['places'][0]['latitude'])
longitude = float(data['places'][0]['longitude'])
except (KeyError, ValueError):
return None
return (latitude, longitude)
$$;
Understanding traffic patterns and risky driving behaviors is critical for smarter, safer mobility. With Databricks AutoML and spatial indexing, teams can build time-aware models without deep ML expertise.
The example below uses automl.forecast to train a time series model on traffic volume (vol) for a specific location (defined by h3_index). By focusing on a single H3 cell, the model captures temporal trends in that area. AutoML handles feature engineering, model tuning, and training—streamlining forecasting for use cases like congestion prediction and aggressive driving detection across zones.
import databricks.automl
table = f"{catalog}.{schema}.traffic_volume_counts_silver"
df = spark.table(table)
h3_cell = 613289796493766655 # replace with any H3 index of interest
df_filtered = df.filter(df.h3_index == h3_cell)
databricks.automl.forecast(
dataset=df_filtered,
target_col="vol",
time_col="timestamp",
experiment_name="traffic_volume_forecast",
primary_metric="smape", # symmetric mean absolute percentage error
timeout_minutes=20
)
databricks.automl.forecast(
dataset=df,
time_col="timestamp",
target_col="vol",
group_col=["h3_index"],
experiment_name="h3_forecast_multi_group",
timeout_minutes=60
)
By pairing geospatial intelligence with AI and real-time processing, automotive organizations can unlock a new level of safety, efficiency, and innovation. From predictive maintenance to smart mobility and EV optimization, Databricks offers the unified platform needed to operationalize these use cases at scale. Customers are unlocking significant value today with our H3 geospatial functions, with much more planned on the product roadmap.
Ready to accelerate your geospatial journey? Explore our Geospatial Solution Accelerators, and try it out in your own workspace today.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.