cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
hubert_boguski
New Contributor III

Introduction

If you’ve ever tried visualizing vast geospatial datasets—such as billions of points—you’ve likely run into performance bottlenecks or high costs.

At S&P Global Commodity Insights, we set out to solve a major challenge: efficiently rendering geospatial content while scaling performance and reducing costs.

Traditional approaches often rely on external geospatial databases or specialized ETL pipelines, forcing data copies and limiting flexibility.

In this blog post, we explore our H3 indexing strategy from a single-source-of-truth. With over 1 billion records, we produce on-demand actionable heatmaps to analyze spatio-temporal trends.

hubert_boguski_0-1757440931404.png

You’ll learn how to:

  • Reduce costs and complexity by eliminating the need for separate geospatial databases while maintaining one set of geospatial capable tables on Delta Lake.
  • Improve query performance by strategically filtering and clustering massive point datasets.
  • Aggregate data on demand using H3’s hierarchical hexagonal cells to scale from neighborhood views to global contexts.

“Leveraging query-time aggregations through H3 indexing enables customers to unlock powerful heat mapping capabilities for spatial analysis while delivering enhanced map performance. Given the large datasets offered at SPGCI, H3 indexing with query-time aggregations provides the scalability and responsiveness required to analyze vast amounts of data efficiently and effectively.” Andrea Anderson - S&P Executive Director of Product Management

To demonstrate and allow others to repeat our approach, we have chosen a representative use case to tackle finding the most efficient time to leave for the airport and maximize profits as a taxi driver. The example uses the massive New York City Taxi Trips public dataset.

Access Commodity Insights datasets to make data-driven decisions here; also, a number of these, and other S&P Global datasets, are already available in the Databricks Marketplace.

 

Previous and Current Architecture

hubert_boguski_1-1757162146358.jpeg

When supporting spatial queries, we often move data into spatial databases. We typically store 1 geometry per table to choose and filter layers separately on the map. This established process creates separate geospatial-specific tables for each geometry (Pickup Point, Drop-off Point, Path Path) and replicates the data into these separate systems.

However, this approach involves additional data movement by duplicating and transferring data into these external systems. This not only reduces data freshness, but also increases storage and compute costs.

hubert_boguski_2-1757162146361.jpeg

Databricks Spatial SQL addresses these problems by enabling performant geospatial queries directly on one geospatial source of truth with optional H3 indexing. By storing geospatial data in Delta Lake using Databricks' native spatial data types (in Public Preview as of DBR 17.1), we unlocked massive performance gains and delivered scalable, cost-efficient visualizations. The native types support easy conversion to/from interchange formats such as WKT (Well Known Text) and binary (Well Known Binary), which were used by the front-end for rendering.

Instead of moving data into external systems, we could now directly request different related map layers from a single, unified table on-the-fly.

hubert_boguski_3-1757162146368.jpeg

hubert_boguski_4-1757162146375.jpeg

hubert_boguski_5-1757162146387.jpeg

For example, the map views above returned the pickup points, drop-off points, or both from a single table.

hubert_boguski_6-1757162146402.jpeg

We could more easily uncover hidden patterns between related geospatial data from a single source.

This approach addressed redundant systems, minimized data movement, and allowed analysis on all columns. Given this data architecture and ease of use, we saw the potential to unlock significant improvements in query performance, cost savings, and the ability to deliver richer geospatial insights.

Note: While our production system uses Open Layers for compatibility with our WMS services, we used kepler.gl during development; also, we provide a leaflet reference app included with the linked assets at the end of this post. We could also have used various Databricks partners to help us meet our visualization requirements.

 

Iteration 1: Optimizing the Map Extent

With every pan and zoom, we queried only the relevant data in the map view. For instance, when a user zoomed into an area near our S&P Global office in New York City searching for pickups, we queried only the pickup points within the visible map extent.

hubert_boguski_7-1757162146409.jpeg

To further improve performance, filtering points outside the map extent provided significant performance gains by leveraging Databricks optimizations. The bounding box of the map extent—defined by its minimum and maximum longitude/latitude values—was key to pruning files efficiently.

Leveraging Predictive Optimization to Enhance Query Performance

In Databricks, logical Delta Lake tables consist of many small Parquet data files along with a delta log. With Predictive Optimization running behind the scenes on a Delta Lake table, file-skipping statistics are collected dynamically. These statistics allow for skipping irrelevant data files that fall outside the bounding box of the user’s map view. This minimizes unnecessary file reads and improves query speed.

Due to the high-cardinality nature of the pickup and dropoff columns, Liquid Clustering was applied to the Delta Lake tables, grouping similar spatial data together and improving query efficiency for random access patterns when filtering for map views. We can cluster by the geohash of the pickup and dropoff point geometries to improve spatial layout within our table.

CREATE OR REPLACE TABLE taxi_trips CLUSTER BY (geohash_pickup, geohash_dropoff) AS (
  SELECT
    ST_GeoHash(geom_pickup) as geohash_pickup,
    ST_GeoHash(geom_dropoff) as geohash_dropoff,
    *
  FROM (
    SELECT
      total_amount,
      ST_Point(pickup_longitude, pickup_latitude, 4326) as geom_pickup,
      ST_Point(dropoff_longitude, dropoff_latitude, 4326) as geom_dropoff,
    -- ... additional fields
    FROM
      yellow_trip
  )
);

Spatial Filtering Pickups And/Or Drop-offs Using Map Extent

As the user interacts with our map interface, we can filter the data using the bounding box of the current viewport. Since Databricks native geometries store geometry bounds, this is a very straightforward query.

CREATE OR REPLACE TEMPORARY VIEW sp_point_v AS (
  WITH map_extent AS (
  SELECT
    ST_GeomFromWKT('POLYGON ((-74.023905 40.700032, -74.023905 40.717794, -73.987169 40.717794, -73.987169 40.700032, -74.023905 40.700032))', 4326) AS sp_office
  )
  SELECT
    pickup_longitude,
    pickup_latitude --,
    -- Optionally query additional points/attributes
    -- dropoff_longitude,
    -- dropoff_latitude,
    -- trip_cost,
    -- ...
  FROM
    taxi_trips,
    map_extent
  WHERE ST_Intersect(sp_office, geom_pickup)
);

The use of ST_Intersects filters to only those geometries that intersect our viewport. Again, we can use the coordinates directly alongside any native geometries for rendering purposes.

Combining File-Skipping Statistics, Liquid Clustering, and Predictive Optimization significantly improved query performance. These optimizations, powered by Databricks SQL Serverless Warehouse, reduced compute workloads, minimized costs, and delivered performant queries.

Why Optimized Filtering for Points Outside the Map Extent Wasn’t Enough

While filtering queries by the extent of the viewport worked well for smaller map views, we wanted higher performance even as users zoomed out on the map. Most notably, our data is larger than what we can reasonably return directly for a larger viewport, without some sort of simplification / aggregation:

  • More points were included within the map extent.
  • Points were likely to be farther from each other and therefore in different partitions, resulting in more files needed to be read.
  • Rendering more points on the client became computationally expensive and, at a certain point, infeasible.

To increase performance at large map extents, we explored a new approach.

hubert_boguski_8-1757162146423.jpeg

 

Iteration 2: Aggregating Data with H3 Indexing

We turned to the H3 spatial indexing system, a hierarchical grid system that divides the world into hexagons at multiple sizes.

Here is the S&P Global office in New York City (point) indexed at resolutions 8, 10, 12 (hexagonal cells):

hubert_boguski_9-1757162269962.jpeg

When comparing point-based queries to H3-indexed queries on our data, we searched the same amount of data zoomed out, yet decreased the total time to see data on the map by over 70%—averaging one-second queries when panning or zooming.

Why H3 Indexing Was the Solution

  1. Reducing Records Returned: Points are grouped into hexagonal cells, significantly reducing the number of rows returned in the result set—returning 1.2 billion points represented as hexagon aggregates when zoomed out in the map.
  2. Scalability: H3’s hierarchical structure allows us to query hexagons at coarser resolutions when zoomed out and finer resolutions when zoomed in, ensuring performant queries at any zoom level in the map.

We indexed each pickup and drop-off point into H3 cells at resolutions 8, 10, and 12, building on the data engineering already accomplished in our initial taxi_trips table. Smaller cells (higher resolutions) like 12 were used for detailed zoom levels, while larger cells (lower resolutions) like 8 were used for broader views. By liquid clustering on the highest resolution, 12, we ensured increased performance for queries with random access patterns.

CREATE OR REPLACE TABLE h3_taxi_trips CLUSTER BY (pickup_cell_12, dropoff_cell_12, geohash_pickup, geohash_dropoff) AS (
  SELECT
    h3_longlatash3(pickup_longitude, pickup_latitude, 12) as pickup_cell_12,
    h3_longlatash3(pickup_longitude, pickup_latitude, 10) as pickup_cell_10,
    h3_longlatash3(pickup_longitude, pickup_latitude, ‌8) as pickup_cell_8,
    h3_longlatash3(dropoff_longitude, dropoff_latitude, 12) as dropoff_cell_12,
    h3_longlatash3(dropoff_longitude, dropoff_latitude, 10) as dropoff_cell_10,
    h3_longlatash3(dropoff_longitude, dropoff_latitude, ‌8) as dropoff_cell_8,
    *
    FROM taxi_trips
);

Covering Map Extent and Returning Aggregated Matches

To display hexagons within the visible screen, we first covered the user’s map extent with H3 cells at a suitable resolution.

Then, we joined these cells to the indexed data to return only the matching hexagons and their aggregated metrics.

hubert_boguski_10-1757162269967.jpeg

hubert_boguski_12-1757162269993.png

Databricks Serverless SQL Warehouse delivered performance on these aggregations at query time, aggregating by any column and any aggregation function allowed upon the user’s request—such as the average trip cost or total revenue in each hexagon.

CREATE OR REPLACE TEMPORARY VIEW pickup_cell_12_v AS (
  WITH map_extent AS (
  SELECT
    'POLYGON ((-73.885202 40.767737, -73.885202 40.780541, -73.854818 40.780541, -73.854818 40.767737, -73.885202 40.767737))' AS bounding_box
  ),
  cells_in_view AS (
    SELECT
      EXPLODE(h3_coverash3(bounding_box, 12)) AS h3_cell
    FROM
      map_extent
  ),
  aggregated_data AS (
    SELECT
      pickup_cell_12,
      AVG(travel_time_minutes) as avg_travel_time_minutes,
      COUNT(*) AS amount_of_pickups,
      SUM(total_amount) AS total_revenue
    FROM
      h3_taxi_trips
    GROUP BY
      pickup_cell_12
  )
  SELECT
    a.pickup_cell_12,
    a.avg_travel_time_minutes,
    a.amount_of_pickups,
    a.total_revenue
  FROM
    aggregated_data a
  JOIN
    cells_in_view v
  ON
    a.pickup_cell_12 = v.h3_cell
);

For example, around the S&P office, we can see that the yellow hexagon zones have longer average taxi trip times so we zoom in to find out more.

hubert_boguski_14-1757162270025.png

Dropoff Cell 8 Layer

hubert_boguski_16-1757162270039.jpeg

Dropoff Cell 10 Layer

hubert_boguski_18-1757162270048.jpeg

Dropoff Cell 12 Layer

hubert_boguski_20-1757162270194.png

Dropoff Points Layer

We see that taxi trips can have longer trip times depending on where the pickup and dropoff occur.

By adjusting the H3 resolution based on the zoom level, we efficiently controlled the granularity of the aggregated data returned and thus maintained responsive, interactive maps. After zooming into the final zoom level, we could see the individual taxi dropoffs (and/or pickups) using the ST_Contains from Iteration 1, which again was efficient at smaller map extents.

By pairing zoom levels to corresponding resolutions, querying at resolution 8, 10, and 12 returned roughly a constant amount of cells to the map. This approach significantly reduced query time, minimized data rendered on the client, and ensured seamless performance across all zoom levels.

Domain Specific Map Analytics

With a single source-of-truth, the approach now allows for analytical workflows to occur at query-time to take actionable decisions.

Different personas can use specific knowledge to query and discover insights on demand:

  • Rush to Flight: Filters can be created to answer specific questions or model specific patterns, such as optimal source (pickup) to destination (dropoff) locations. As an S&P Global employee in the NYC office, I frequently have international flights departing from JFK to Poland. To ensure I arrive at the airport with enough time for my flight, I want to find the shortest travel time given my usual flight schedule (some day, month, or year). With both pickups and dropoff points in the same table, I can analyze previous pickups around the S&P NYC office and dropoffs around JFK international airport that have the fastest travel times and target the best zones for pick-ups.
  • Taxi Drivers: Distance trends between H3 cells can be spotted. As a taxi driver, I want to maximize my daily earnings. The less time I spend searching for customers, the more income I usually make. Therefore, I need to know which taxi zones have frequent drop-offs followed by nearby pickups. H3 excels at fast distance calculations, allowing me to quickly identify these zones. With all the data indexed in a single table, I can narrow my search to drop-off zones near pick-up zones with above-average trip costs and high tips, maximizing my time transporting customers and earnings.
  • Oil and Gas Analysts: Spatio-temporal patterns can be analyzed. As an oil and gas analyst, I want to understand production trends to make informed decisions. I need to visualize and analyze data from multiple wells to identify which wells are producing the most over time. With query-time aggregations, I can adjust the date while filtering by any attribute to maximize profits.

 

Iteration 3: Optimizing Storage with Dynamic H3 Resolutions

While indexing at multiple resolutions improved query performance, it can increase data storage volume, especially when leveraging linear and areal spatial data. To help in those situations, we explored a query-time resolution approach using H3’s hierarchical functions.

How Dynamic Resolutions Work

Instead of precomputing multiple H3 resolutions, we could have indexed points only at the finest resolution (12) and dynamically calculated coarser parent cells at query-time using H3_ToParent (refer to table h3_taxi_trips which includes the resolution 12 indexing). We can calculate approximated aggregations by moving “up” to a larger cell, e.g. 9, or any others (< 12), including those which were never indexed in previous iterations.

CREATE OR REPLACE TEMPORARY VIEW pickup_parent_cell_9_v AS (
  WITH map_extent AS (
  SELECT
    ST_GeomFromWKT('POLYGON ((-74.050598 40.699381, -74.050598 40.793279, -73.848724 40.793279, -73.848724 40.699381, -74.050598 40.699381))', 4326) AS bounding_box
  ),
  cells_in_view AS (
    SELECT
      EXPLODE(h3_coverash3(st_astext(bounding_box), 9)) AS h3_cell
    FROM
      map_extent
  ),
  parent_cells AS (
    SELECT
      h3_toparent(pickup_cell_12, 9) AS pickup_cell_9, -- Note Resolution 9 was never indexed, but we can calculate by moving "up" H3 hierarchy
      total_amount
    FROM
      h3_taxi_trips
  ),
  aggregated_data AS (
    SELECT
      pickup_cell_9,
      COUNT(*) AS amount_of_pickups,
      AVG(total_amount) AS avg_trip_cost
    FROM
      parent_cells
    GROUP BY
      pickup_cell_9
  )
  SELECT
    a.pickup_cell_9,
    a.amount_of_pickups,
    a.avg_trip_cost
  FROM
    aggregated_data a
  JOIN
    cells_in_view v
  ON
    a.pickup_cell_9 = v.h3_cell
);

Now, we can easily (and flexibly) filter by pickups between January 1, 2001 and September 13, 2021 (spanning the entire dataset) and then order by the highest average trip cost to return the cells with the highest trip costs over time, aggregating by year and month. See the included notebook  series for additional details and tips.

 

Conclusion

This solution delivers speed, scalability, and simplicity. By adopting a single-source-of-truth architecture, we addressed the cost and complexity of duplicative data systems. Our approach not only streamlined data operations and enhanced analysis—it also cut query times significantly at any place in the map.

Large-scale queries at broader zoom levels saw map result times reduced by up to 70% and detailed zoom levels maintained this performance.

With Databricks Spatial SQL and H3 indexing we achieved:

  • Unified Architecture: Querying directly on Delta tables with open data formats removed the friction of multiple external tables, enabling consistent and fresh data access.
  • Performance at Scale: H3 indexing empowered on-demand aggregations, yielding instant heatmaps and spatio-temporal insights on 1.2 billion taxi pickups.
  • Cost Efficiency: By leveraging predictive optimization and liquid clustering with dynamic resolution queries, we eliminated unnecessary file reads and minimized operational overhead—leading to both faster queries and lower compute costs.

The integration of Databricks Spatial SQL and spatial indexing has opened doors for our use cases. From pinpointing the busiest NYC taxi zones, to analyzing gas and oil production trends, this solution delivers interactive geospatial analytics that scale.

Explore all the code as well as an interactive reference app in this GitHub repo.