cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
PratikshaSaha
Databricks Employee
Databricks Employee

I think the base is the best bit of a cheesecake, always have and always will, and when I started looking at geospatial data in Databricks, I was eating a cheesecake. 

Government agencies dealing with geospatial data are stuck in a perfect storm of lots of crucial data, think flood zones, defences, historical data and river networks. Every iota of data is mission critical, for planning decisions, insurance assessments and public safety. But in order to tame it, to funnel it into cogent pipelines, you need to tackle massive GeoPackage files totalling 10s of GBs with millions of spatial features.

Once that data is in a SQL warehouse, you run a spatial query, a single point-in polygon check against a national dataset. 19 hours later, you might have an answer, but by then you have probably forgotten why you asked in the first place. 

Historically, without spatial indexes, a SQL warehouse would scan every single polygon to check if your point falls inside it. Multiply that across flood zones, defences, historic floods, and rivers and you may be watching that sprinter churn for days. This is the reality of geospatial at a national scale on a platform that wasn’t designed for it -  like making a cheesecake in a chocolate factory. 

SQL warehouses are phenomenal at what they do, columnar analytics, massive joins, BI dashboards. And while Photon has made great strides with spatial join optimisations, when it comes to real-time serving of point-in-polygon queries against millions of complex geometries, a dedicated GIST index is what you need. Without the right equipment, the process takes a long time. Like whisking with a toothpick.

In the past, we obsessed about scaling Databricks’ capabilities, from ETL pipelines to handling streaming data to building AI agents. But now we are at that sweet pinnacle where Databricks can optimise and run almost every data and AI workload for every business use case. 

The question now becomes more nuanced - what combination of tools can we select, can we layer on top of one another, to make the most optimal solution? The tastiest cheesecake, if you will. 

National organisations no longer want to simply query their geospatial data, they would like to be able to interact with it. They would really like to be able to do this in realtime via an app. And they would love to do all this in a governed environment. 

We decided to recipe test Lakebase to see if we could demo all this, and throw in one more surprise at the end…

Before sprucing up our architecture, it’s important to understand how the cookie is traditionally crumbled. Typically, serving geospatial in a real time application resulted in 3 separate systems. The Lakehouse would be used for batch processing the raw GeoPackage files in Spark and Delta, validating geometries and running spatial joins at scale. This however is too slow for real-time serving, with even simple queries taking at least 500ms to up to a couple of minutes with warehouse cold starts.

To handle this, another spatial database would be required outside Databricks, and you would write custom ETL to copy data from Delta to Postgres. This means separate governance, separate infrastructure and separate costs. 

And to top it all off you would need a GIS server, like MapServer for tile serving. 

It’s like making a cheesecake across 3 different kitchens, mixing the batter in one (Lakehouse for batch processing), baking in another (a separate Postgres instance for serving) and adding the topping in a third (a GIS server for tiles). You're constantly ferrying ingredients between rooms, each kitchen has its own set of keys, and if someone changes the recipe in one kitchen, the others don't know about it.

The result - an Eton Mess. 

With Databricks, we wanted to design one kitchen with everything you need. Lakebase is the KitchenAid. We wanted to build an interactive web application using the UK Environment Agency’s flood risk data for all of England, where users can view flood zones on a map, click any location and get an instant flood risk assessment, entirely within Databricks.

The core approach to this demo was, use every Databricks component for what it does best. 

We used Spark SQL for batch ETL, parsing these GeoPackage files using ogr2ogr for fast native streaming conversion using fiona (Python wrapper) as a fallback. Spark then reads the GeoJSON lines as text, extracting properties and geometries and writing to Delta.  The result was 3.9 million features across 5 Delta tables, all governed by Unity Catalog. 

flood-risk-architecture.png

Then we whipped up a Lakebase autoscaling instance and enabled PostGIS. Why autoscaling you might ask? Well most GIS applications typically have high usage during business hours and almost zero at 3AM in the morning. Scale-to-Zero, a unique Lakebase capability ensures that you aren’t paying for a high memory spatial database while your users are asleep. With geospatial queries complexity can vary significantly, one user may just be looking at a point on a map (low CPU), while another triggers a massive ST_Intersection (high CPU). Autoscaling can add Compute Units only when those spatial functions hit the engine. Tile serving patterns are also ‘bursty’, coming in intermittent floods of 50-100 tile requests when a user zooms or pans. Autoscaling can handle this smoothly, without you having to over-provision for the idle time in between zooms. 

The one decision I had to make here was setting a minimum and maximum Compute Unit range for Lakebase to scale within. Each CU provides about 2GB of RAM, and the range between min and max can’t exceed 8 CU, keeping scaling responsive rather than erratic. For this app, I chose 2-8 CU, to give enough headroom for those bursty tile requests and heavy ST_Intersection queries, but not so that it sits at 32 CU when no one is looking at the map. 

Now with the data in Delta, we synced the tables into Lakebase. A Lakebase sync is essentially pointing Lakebase at a Unity Catalog table and it continuously replicates the data into a Lakebase Postgres table, handling the schema mapping automatically. One thing to note: Delta doesn't have a native geometry type, so we stored geometries as GeoJSON strings and materialised them as PostGIS geometry columns in Lakebase after the sync, where we could then apply the GIST index. This GIST index is critical, its the cream cheese, transforming Lakebase from just another Postgres into a real time geospatial engine. We found spatial intersection queries against 615k polygons take around 4100ms without it. With it, this shrinks to 80 ms. 

Now Lakebase sync is glorious, but what if your data isn’t in Delta? No fear, because we tested that too. For an Environmental Land Management app, we connected to Lakebase directly via psycopg, and used Python and SQL to read and insert the rows of land parcel data directly before applying the GIST index. For the flood risk app, I wanted it at a national scale so relied on Spark SQL’s distributed processing capabilities. Spark SQL is the oven here, sometimes you have a no bake cheesecake and you can just leave it to set in the fridge. But other times you need real heat to help something come together, and that's when Spark SQL comes to help. 

Now we have our Lakebase primed, it’s time to serve it up. Databricks Apps allows you to deploy a full web application inside your workspace, with the same governance and the same auth. We built ours with FastAPI on the backend and Leaflet.js for the interactive app on the frontend, all running on Databricks compute. The app authenticates to Lakebase using OAuth, there are no separate credentials, or API keys in environment variables. It serves simplified GeoJSON features per viewport, each layer request uses ST_AsGeoJSON with ST_SimplifyPreserveTopology to keep payloads light, rather than vector tiles. This means when a user clicks a point on the map, the FastAPI backend fires PostGIS queries against Lakebase, checking flood zones, counting historic floods within 500m and returns an instant risk assessment. 

The cheesecake can be decorated however you like, we could have swapped Leaflet for MapBox or added a Streamlit dashboard alongside the map. As the base is solid - Lakebase, PostGIS, GIST indexes, the toppings can be customised as you choose. 

For the extra surprise, the one that gets us the Hollywood handshake, we embedded an AI planning adviser. When a user clicks a location and gets their flood risk assessment, they can now ask the question that actually matters - Can I build here?

That spatial context from PostGIS - which flood zone, how far from the nearest defence, historic flood details, this all gets passed to Claude Sonnet via Databricks Foundation Model APIs. But we didn’t want the agent to rely on its training data alone. We loaded the National Planning Policy Framework, Environment Agency standing advice and local authority flood guidance into Databricks Vector Search. Before responding, the agent retrieves the most relevant policy sections based on the spatial context, so if it tells you a Flood Risk Assessment is required in Zone 3, this is grounded in actual guidance, not simply paraphrasing from memory. 

The agent interprets the data through the lens of policy, and gives the user a plain English answer with concrete next steps, rather than simply a risk score and a colour. The important thing here is that nothing about the base has changed. The Lakebase queries are the same, the GIST indexes are the same, the app is the same. We have just piped the output into an LLM grounded in real policy documents. One more layer on the cheesecake.

We set out to make a cheesecake, all in one kitchen. Spark SQL for the heavy mixing, Lakebase with PostGIS and GIST indexes for the base, Databricks Apps for the serving, Foundation Model APIs and Vector Search for insight. 

Way better than an Eton Mess. 

If you are working with geospatial data at scale, give Lakebase a try. Enable PostGIS, add a GIST index, and time the difference. Our GIST index alone turned 4-second queries into 80-millisecond ones, and the whole recipe replaced a 19-hour warehouse ordeal with something you can serve in real time.

Your cheesecake might turn out even better.

Contributors