This is the first part of a two-part series blog on geospatial data processing on Databricks. The first part will cover ingesting and processing Overture Maps data on Databricks, while the second part will delve into a practical use case on dynamic segmentation.
Geospatial data is transforming how we understand and interact with our world, but processing this data efficiently at scale remains a significant challenge.
Overture Maps is a rich, open source geospatial dataset that promises to revolutionise mapping and location-based services. However, the sheer volume and complexity of this data can overwhelm traditional processing methods. In this blog, we'll explore a practical approach to tackling this challenge with Databricks, focusing on filtering Overture map data for Victoria, Australia. Along the way, we'll uncover techniques for parameterising notebooks, automating workflows, and benchmarking performance across different cluster sizes.
The Overture Maps Foundation has adopted the GeoParquet format for publishing its geospatial data. While it's possible to access specific data subsets using tools like Apache Sedona on Spark, in this blog we focus on downloading the entire dataset and applying our own filters and transformations. This approach offers several advantages:
Databricks is a data intelligence platform that provides robust analytics and machine learning for spatial and aspatial data. For geospatial data ingestion and processing, Databricks integrates seamlessly with various geospatial libraries and tools (e.g., Sedona, Databricks Mosaic, Geopandas, GDAL etc). Recently Databricks has announced Spatial SQL, a native geospatial capability designed to enhance spatial data handling. As of this writing, Spatial SQL is in Private Preview for DBR 14.3+. Users interested in exploring this upcoming functionality should reach out to their Databricks account team to inquire about participating in the preview program.
There are various ways to download the Overture maps data. We'll use azcopy to copy the data to a UC volume using a Databricks Notebook.
Firstly, install azcopy on the Databricks cluster if you haven’t done so:
|
This is my catalogue structure. I created a UC volume, danny_catalog.overture.raw, to store the geoparquet files. You can create yours with the UI or using SQL.
Create a folder in the raw UC volume, then copy the data using azcopy to that folder:
|
Copying 431 GB of data (August 2024 release) from the Overture storage account to the UC volume took approximately 23 minutes.
However, it's important to note that transfer times can vary significantly based on several factors:
To automate the filtering process, we will leverage Databricks workflows and parameterised notebooks. This pipeline will run on a monthly schedule, ensuring that the filtered data remains up-to-date. The core of this pipeline involves filtering the Overture Maps data based on a multi-polygon boundary that closely approximates the Victoria region, with adjustments to include areas near but slightly outside the official boundary as those are also the areas of interest. This filtered data will then be persisted as a Delta table, making it readily available for downstream systems and processes to consume. The filtering logic utilises Spatial SQL's powerful ST_ functions to efficiently process the geospatial data. By automating this process, we ensure consistent and reliable data updates, which are crucial for our subsequent analyses and applications.
Let’s create the pipeline notebook. Firstly we will create the notebook widgets:
|
You can provide default values to the notebook widgets such as:
|
This is how it shows up on the notebook UI. Populate the value that you desire for the widget and it will be referenced later:
My aus_polygon value is:
|
It is a multi-polygon that covers the State of Victoria in Australia with some buffers at the border. If you want to follow along, you can use this value or the geometry relevant to your use case.
Get the values from the notebook widget and store them as variables:
|
Read the geoparquet file for the map theme and map type based on the input from the notebook widget. Using the st_contains function from the Spatial SQL to filter data and only keeps those in the Victoria polygon.
|
Click “Create Job” on the Workflow page.
Click “Edit parameters” under “Job parameters”. Enter the value of the job level parameters. These values will be passed to your notebook when your job runs.
Add the task-level parameters.
This is the Overture map data structure, we are creating all the tasks based on this structure.
Given the high degree of similarity among the tasks, switching to YAML code mode can streamline the task creation process. This allows you to copy, paste, and edit the tasks more efficiently, leveraging their similarities. Alternatively you can use the for each task capability that was recently released.
Now, we have successfully created a workflow with 14 tasks that run in parallel on the same job clusters.
Cluster Size |
Time |
Databricks Cost for this job |
Single node |
137 minutes |
$1.71 |
1 driver + 2 workers |
69 minutes |
$2.59 |
1 driver + 4 workers |
38 minutes |
$2.38 |
1 driver + 8 workers |
21 minutes |
$2.36 |
1 driver + 16 workers |
16 minutes |
$3.40 |
VM type for driver and workers: Standard_D4ds_v5
You can process the world’s geo data for less than the price of an ice cream scoop! Databricks provides you with the flexibility to select the appropriate cluster size to align with the specific time constraints and budgetary considerations. And you can now start working on your Overture data!
Sample query to test it out:
|
It converts the geometry data of these buildings into a text format and applies a tessellation function to the geometry using the H3 geospatial indexing system. You can now do it with Spatial SQL on Databricks, without installing any additional libraries, with either a notebook or an SQL editor.
|
This SQL code creates a new table that is clustered using Databricks’ liquid clustering feature, specifically on these H3 cell IDs. By clustering data based on H3 indexes, the system ensures that spatially related information is stored together, greatly enhancing the speed and efficiency of spatial queries and analytics. This optimization is particularly valuable for big data applications, allowing for flexible, automatic, and efficient processing of massive geospatial datasets without the need for complex manual partitioning strategies.
In this blog, we've navigated the exciting journey of transforming Overture Maps data using Databricks, showcasing how to efficiently process and refine geospatial data at scale. By leveraging Databricks' powerful capabilities, including parameterised notebooks and automated workflows, we've streamlined the process of filtering and analysing complex datasets.
Ready to elevate your geospatial analytics? Dive into Databricks' geospatial capabilities and experience the power of Spatial SQL. Start your journey today and see how Databricks can uncover the stories behind your geospatial data!
In our next blog, we'll explore geospatial analytics in more depth, focusing on dynamic segmentation using Apache Sedona on Databricks. Don't miss the opportunity to see how these advanced techniques can further enhance your geospatial data analysis.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.