If you’ve ever tried visualizing vast geospatial datasets—such as billions of points—you’ve likely run into performance bottlenecks or high costs.
At S&P Global Commodity Insights, we set out to solve a major challenge: efficiently rendering geospatial content while scaling performance and reducing costs.
Traditional approaches often rely on external geospatial databases or specialized ETL pipelines, forcing data copies and limiting flexibility.
In this blog post, we explore our H3 indexing strategy from a single-source-of-truth. With over 1 billion records, we produce on-demand actionable heatmaps to analyze spatio-temporal trends.
You’ll learn how to:
“Leveraging query-time aggregations through H3 indexing enables customers to unlock powerful heat mapping capabilities for spatial analysis while delivering enhanced map performance. Given the large datasets offered at SPGCI, H3 indexing with query-time aggregations provides the scalability and responsiveness required to analyze vast amounts of data efficiently and effectively.” Andrea Anderson - S&P Executive Director of Product Management
To demonstrate and allow others to repeat our approach, we have chosen a representative use case to tackle finding the most efficient time to leave for the airport and maximize profits as a taxi driver. The example uses the massive New York City Taxi Trips public dataset.
Access Commodity Insights datasets to make data-driven decisions here; also, a number of these, and other S&P Global datasets, are already available in the Databricks Marketplace.
When supporting spatial queries, we often move data into spatial databases. We typically store 1 geometry per table to choose and filter layers separately on the map. This established process creates separate geospatial-specific tables for each geometry (Pickup Point, Drop-off Point, Path Path) and replicates the data into these separate systems.
However, this approach involves additional data movement by duplicating and transferring data into these external systems. This not only reduces data freshness, but also increases storage and compute costs.
Databricks Spatial SQL addresses these problems by enabling performant geospatial queries directly on one geospatial source of truth with optional H3 indexing. By storing geospatial data in Delta Lake using Databricks' native spatial data types (in Public Preview as of DBR 17.1), we unlocked massive performance gains and delivered scalable, cost-efficient visualizations. The native types support easy conversion to/from interchange formats such as WKT (Well Known Text) and binary (Well Known Binary), which were used by the front-end for rendering.
Instead of moving data into external systems, we could now directly request different related map layers from a single, unified table on-the-fly.
For example, the map views above returned the pickup points, drop-off points, or both from a single table.
We could more easily uncover hidden patterns between related geospatial data from a single source.
This approach addressed redundant systems, minimized data movement, and allowed analysis on all columns. Given this data architecture and ease of use, we saw the potential to unlock significant improvements in query performance, cost savings, and the ability to deliver richer geospatial insights.
Note: While our production system uses Open Layers for compatibility with our WMS services, we used kepler.gl during development; also, we provide a leaflet reference app included with the linked assets at the end of this post. We could also have used various Databricks partners to help us meet our visualization requirements.
With every pan and zoom, we queried only the relevant data in the map view. For instance, when a user zoomed into an area near our S&P Global office in New York City searching for pickups, we queried only the pickup points within the visible map extent.
To further improve performance, filtering points outside the map extent provided significant performance gains by leveraging Databricks optimizations. The bounding box of the map extent—defined by its minimum and maximum longitude/latitude values—was key to pruning files efficiently.
In Databricks, logical Delta Lake tables consist of many small Parquet data files along with a delta log. With Predictive Optimization running behind the scenes on a Delta Lake table, file-skipping statistics are collected dynamically. These statistics allow for skipping irrelevant data files that fall outside the bounding box of the user’s map view. This minimizes unnecessary file reads and improves query speed.
Due to the high-cardinality nature of the pickup and dropoff columns, Liquid Clustering was applied to the Delta Lake tables, grouping similar spatial data together and improving query efficiency for random access patterns when filtering for map views. We can cluster by the geohash of the pickup and dropoff point geometries to improve spatial layout within our table.
CREATE OR REPLACE TABLE taxi_trips CLUSTER BY (geohash_pickup, geohash_dropoff) AS (
SELECT
ST_GeoHash(geom_pickup) as geohash_pickup,
ST_GeoHash(geom_dropoff) as geohash_dropoff,
*
FROM (
SELECT
total_amount,
ST_Point(pickup_longitude, pickup_latitude, 4326) as geom_pickup,
ST_Point(dropoff_longitude, dropoff_latitude, 4326) as geom_dropoff,
-- ... additional fields
FROM
yellow_trip
)
);
As the user interacts with our map interface, we can filter the data using the bounding box of the current viewport. Since Databricks native geometries store geometry bounds, this is a very straightforward query.
CREATE OR REPLACE TEMPORARY VIEW sp_point_v AS (
WITH map_extent AS (
SELECT
ST_GeomFromWKT('POLYGON ((-74.023905 40.700032, -74.023905 40.717794, -73.987169 40.717794, -73.987169 40.700032, -74.023905 40.700032))', 4326) AS sp_office
)
SELECT
pickup_longitude,
pickup_latitude --,
-- Optionally query additional points/attributes
-- dropoff_longitude,
-- dropoff_latitude,
-- trip_cost,
-- ...
FROM
taxi_trips,
map_extent
WHERE ST_Intersect(sp_office, geom_pickup)
);
The use of ST_Intersects filters to only those geometries that intersect our viewport. Again, we can use the coordinates directly alongside any native geometries for rendering purposes.
Combining File-Skipping Statistics, Liquid Clustering, and Predictive Optimization significantly improved query performance. These optimizations, powered by Databricks SQL Serverless Warehouse, reduced compute workloads, minimized costs, and delivered performant queries.
While filtering queries by the extent of the viewport worked well for smaller map views, we wanted higher performance even as users zoomed out on the map. Most notably, our data is larger than what we can reasonably return directly for a larger viewport, without some sort of simplification / aggregation:
To increase performance at large map extents, we explored a new approach.
We turned to the H3 spatial indexing system, a hierarchical grid system that divides the world into hexagons at multiple sizes.
Here is the S&P Global office in New York City (point) indexed at resolutions 8, 10, 12 (hexagonal cells):
When comparing point-based queries to H3-indexed queries on our data, we searched the same amount of data zoomed out, yet decreased the total time to see data on the map by over 70%—averaging one-second queries when panning or zooming.
We indexed each pickup and drop-off point into H3 cells at resolutions 8, 10, and 12, building on the data engineering already accomplished in our initial taxi_trips table. Smaller cells (higher resolutions) like 12 were used for detailed zoom levels, while larger cells (lower resolutions) like 8 were used for broader views. By liquid clustering on the highest resolution, 12, we ensured increased performance for queries with random access patterns.
CREATE OR REPLACE TABLE h3_taxi_trips CLUSTER BY (pickup_cell_12, dropoff_cell_12, geohash_pickup, geohash_dropoff) AS (
SELECT
h3_longlatash3(pickup_longitude, pickup_latitude, 12) as pickup_cell_12,
h3_longlatash3(pickup_longitude, pickup_latitude, 10) as pickup_cell_10,
h3_longlatash3(pickup_longitude, pickup_latitude, 8) as pickup_cell_8,
h3_longlatash3(dropoff_longitude, dropoff_latitude, 12) as dropoff_cell_12,
h3_longlatash3(dropoff_longitude, dropoff_latitude, 10) as dropoff_cell_10,
h3_longlatash3(dropoff_longitude, dropoff_latitude, 8) as dropoff_cell_8,
*
FROM taxi_trips
);
To display hexagons within the visible screen, we first covered the user’s map extent with H3 cells at a suitable resolution.
Then, we joined these cells to the indexed data to return only the matching hexagons and their aggregated metrics.
Databricks Serverless SQL Warehouse delivered performance on these aggregations at query time, aggregating by any column and any aggregation function allowed upon the user’s request—such as the average trip cost or total revenue in each hexagon.
CREATE OR REPLACE TEMPORARY VIEW pickup_cell_12_v AS (
WITH map_extent AS (
SELECT
'POLYGON ((-73.885202 40.767737, -73.885202 40.780541, -73.854818 40.780541, -73.854818 40.767737, -73.885202 40.767737))' AS bounding_box
),
cells_in_view AS (
SELECT
EXPLODE(h3_coverash3(bounding_box, 12)) AS h3_cell
FROM
map_extent
),
aggregated_data AS (
SELECT
pickup_cell_12,
AVG(travel_time_minutes) as avg_travel_time_minutes,
COUNT(*) AS amount_of_pickups,
SUM(total_amount) AS total_revenue
FROM
h3_taxi_trips
GROUP BY
pickup_cell_12
)
SELECT
a.pickup_cell_12,
a.avg_travel_time_minutes,
a.amount_of_pickups,
a.total_revenue
FROM
aggregated_data a
JOIN
cells_in_view v
ON
a.pickup_cell_12 = v.h3_cell
);
For example, around the S&P office, we can see that the yellow hexagon zones have longer average taxi trip times so we zoom in to find out more.
Dropoff Cell 8 Layer
Dropoff Cell 10 Layer
Dropoff Cell 12 Layer
Dropoff Points Layer
We see that taxi trips can have longer trip times depending on where the pickup and dropoff occur.
By adjusting the H3 resolution based on the zoom level, we efficiently controlled the granularity of the aggregated data returned and thus maintained responsive, interactive maps. After zooming into the final zoom level, we could see the individual taxi dropoffs (and/or pickups) using the ST_Contains from Iteration 1, which again was efficient at smaller map extents.
By pairing zoom levels to corresponding resolutions, querying at resolution 8, 10, and 12 returned roughly a constant amount of cells to the map. This approach significantly reduced query time, minimized data rendered on the client, and ensured seamless performance across all zoom levels.
With a single source-of-truth, the approach now allows for analytical workflows to occur at query-time to take actionable decisions.
Different personas can use specific knowledge to query and discover insights on demand:
While indexing at multiple resolutions improved query performance, it can increase data storage volume, especially when leveraging linear and areal spatial data. To help in those situations, we explored a query-time resolution approach using H3’s hierarchical functions.
Instead of precomputing multiple H3 resolutions, we could have indexed points only at the finest resolution (12) and dynamically calculated coarser parent cells at query-time using H3_ToParent (refer to table h3_taxi_trips which includes the resolution 12 indexing). We can calculate approximated aggregations by moving “up” to a larger cell, e.g. 9, or any others (< 12), including those which were never indexed in previous iterations.
CREATE OR REPLACE TEMPORARY VIEW pickup_parent_cell_9_v AS (
WITH map_extent AS (
SELECT
ST_GeomFromWKT('POLYGON ((-74.050598 40.699381, -74.050598 40.793279, -73.848724 40.793279, -73.848724 40.699381, -74.050598 40.699381))', 4326) AS bounding_box
),
cells_in_view AS (
SELECT
EXPLODE(h3_coverash3(st_astext(bounding_box), 9)) AS h3_cell
FROM
map_extent
),
parent_cells AS (
SELECT
h3_toparent(pickup_cell_12, 9) AS pickup_cell_9, -- Note Resolution 9 was never indexed, but we can calculate by moving "up" H3 hierarchy
total_amount
FROM
h3_taxi_trips
),
aggregated_data AS (
SELECT
pickup_cell_9,
COUNT(*) AS amount_of_pickups,
AVG(total_amount) AS avg_trip_cost
FROM
parent_cells
GROUP BY
pickup_cell_9
)
SELECT
a.pickup_cell_9,
a.amount_of_pickups,
a.avg_trip_cost
FROM
aggregated_data a
JOIN
cells_in_view v
ON
a.pickup_cell_9 = v.h3_cell
);
Now, we can easily (and flexibly) filter by pickups between January 1, 2001 and September 13, 2021 (spanning the entire dataset) and then order by the highest average trip cost to return the cells with the highest trip costs over time, aggregating by year and month. See the included notebook series for additional details and tips.
This solution delivers speed, scalability, and simplicity. By adopting a single-source-of-truth architecture, we addressed the cost and complexity of duplicative data systems. Our approach not only streamlined data operations and enhanced analysis—it also cut query times significantly at any place in the map.
Large-scale queries at broader zoom levels saw map result times reduced by up to 70% and detailed zoom levels maintained this performance.
With Databricks Spatial SQL and H3 indexing we achieved:
The integration of Databricks Spatial SQL and spatial indexing has opened doors for our use cases. From pinpointing the busiest NYC taxi zones, to analyzing gas and oil production trends, this solution delivers interactive geospatial analytics that scale.
Explore all the code as well as an interactive reference app in this GitHub repo.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.