Databricks Community

dannywong · ‎09-19-2024

This is the first part of a two-part series blog on geospatial data processing on Databricks. The first part will cover ingesting and processing Overture Maps data on Databricks, while the second part will delve into a practical use case on dynamic segmentation.

Geospatial data is transforming how we understand and interact with our world, but processing this data efficiently at scale remains a significant challenge.

Overture Maps is a rich, open source geospatial dataset that promises to revolutionise mapping and location-based services. However, the sheer volume and complexity of this data can overwhelm traditional processing methods. In this blog, we'll explore a practical approach to tackling this challenge with Databricks, focusing on filtering Overture map data for Victoria, Australia. Along the way, we'll uncover techniques for parameterising notebooks, automating workflows, and benchmarking performance across different cluster sizes.

Overview of Overture Maps Data

The Overture Maps Foundation has adopted the GeoParquet format for publishing its geospatial data. While it's possible to access specific data subsets using tools like Apache Sedona on Spark, in this blog we focus on downloading the entire dataset and applying our own filters and transformations. This approach offers several advantages:

Data Sovereignty and Control: By ingesting the entire dataset, you gain complete control over your data. This approach ensures that you're not dependent on external services or potential changes in data accessibility.
Performance Optimization: Having the data within your own environment allows for fine-tuned performance optimisations. You can structure the data in ways that best suit your specific use cases (e.g. liquid clustering), potentially leading to faster query times and more efficient processing.
Customisation Flexibility: With the full dataset at your disposal, you have the freedom to create custom data models, apply your own transformations, or combine Overture data with other proprietary datasets seamlessly.

Setting Up Databricks for Geospatial Data Ingestion

Databricks is a data intelligence platform that provides robust analytics and machine learning for spatial and aspatial data. For geospatial data ingestion and processing, Databricks integrates seamlessly with various geospatial libraries and tools (e.g., Sedona, Databricks Mosaic, Geopandas, GDAL etc). Recently Databricks has announced Spatial SQL, a native geospatial capability designed to enhance spatial data handling. As of this writing, Spatial SQL is in Private Preview for DBR 14.3+. Users interested in exploring this upcoming functionality should reach out to their Databricks account team to inquire about participating in the preview program.

Downloading the data

There are various ways to download the Overture maps data. We'll use azcopy to copy the data to a UC volume using a Databricks Notebook.

Firstly, install azcopy on the Databricks cluster if you haven’t done so:

%sh
sudo bash -c 'cd /usr/local/bin; curl -L https://aka.ms/downloadazcopy-v10-linux | tar --strip-components=1 --exclude=*.txt -xzvf -; chmod +x azcopy'

This is my catalogue structure. I created a UC volume, danny_catalog.overture.raw, to store the geoparquet files. You can create yours with the UI or using SQL.

Create a folder in the raw UC volume, then copy the data using azcopy to that folder:

%sh
mkdir -p /Volumes/danny_catalog/overture/raw/2024-08-20.0

azcopy copy "https://overturemapswestus2.dfs.core.windows.net/release/2024-08-20.0/" "/Volumes/danny_catalog/overture/raw/"  --recursive

Copying 431 GB of data (August 2024 release) from the Overture storage account to the UC volume took approximately 23 minutes.

However, it's important to note that transfer times can vary significantly based on several factors:

Geographic proximity: The time depends on the region of your Databricks workspace relative to the Overture data source.
Dataset size: Larger datasets will naturally take longer to transfer.
Network conditions: Transfer speeds can be affected by current network traffic and bandwidth.
Cluster resources: The specifications of your Databricks cluster may impact transfer speeds.

Create the filtering pipeline

To automate the filtering process, we will leverage Databricks workflows and parameterised notebooks. This pipeline will run on a monthly schedule, ensuring that the filtered data remains up-to-date. The core of this pipeline involves filtering the Overture Maps data based on a multi-polygon boundary that closely approximates the Victoria region, with adjustments to include areas near but slightly outside the official boundary as those are also the areas of interest. This filtered data will then be persisted as a Delta table, making it readily available for downstream systems and processes to consume. The filtering logic utilises Spatial SQL's powerful ST_ functions to efficiently process the geospatial data. By automating this process, we ensure consistent and reliable data updates, which are crucial for our subsequent analyses and applications.

Environmental setup

Let’s create the pipeline notebook. Firstly we will create the notebook widgets:

dbutils.widgets.text("catalog", "") # Catalog name
dbutils.widgets.text("schema", "") # Schema name
dbutils.widgets.text("table", "") # The filtered geo data by theme and type
dbutils.widgets.text("volume", "") # The volume that holds the raw files
dbutils.widgets.text("map_theme", "") # Overture map theme
dbutils.widgets.text("map_type", "") # Overture map type
dbutils.widgets.text("release", "") # Overture release 
dbutils.widgets.text("aus_polygon", "") # The geometry for the filter

You can provide default values to the notebook widgets such as:

dbutils.widgets.text("catalog", "danny_catalog")

This is how it shows up on the notebook UI. Populate the value that you desire for the widget and it will be referenced later:

My aus_polygon value is:

MULTIPOLYGON(((146.224235224402 -35.4363358189844,148.225501405747 -35.647622056311,148.96662090591 -36.4718804131401,150.580844867015 -37.4059734691392,150.387767097845 -37.9947686179651,148.333245272028 -38.1583694792249,147.167982775491 -38.8351013798231,146.39794836319 -39.6562618160424,144.470264564547 -38.8532552527778,143.457314156555 -39.3875969932018,141.63332727031 -38.8880005741606,140.233033758295 -38.5024700559333,140.387650481347 -34.1564967083719,140.537933347279 -33.5832702977468,142.035376107123 -33.6588797210494,142.579114407923 -33.9833605220553,143.478747807493 -34.4440846825862,144.45098970567 -35.2900425539108,144.961566823313 -35.4071384438589,146.224235224402 -35.4363358189844)))

It is a multi-polygon that covers the State of Victoria in Australia with some buffers at the border. If you want to follow along, you can use this value or the geometry relevant to your use case.

Get the values from the notebook widget and store them as variables:

mapTheme = getArgument("map_theme")
mapType = getArgument("map_type")
catalog = getArgument("catalog")
schema = getArgument("schema")
table = getArgument("table")
volume = getArgument("volume")
release = getArgument("release")
aus_polygon = getArgument("aus_polygon")

Processing

Read the geoparquet file for the map theme and map type based on the input from the notebook widget. Using the st_contains function from the Spatial SQL to filter data and only keeps those in the Victoria polygon.

from pyspark.sql import functions as F

df = spark.read.parquet(f"/Volumes/{catalog}/{schema}/{volume}/{release}/theme={mapTheme}/type={mapType}/")

df_australia = df.filter(F.expr(f"st_contains(st_geomfromwkt('{aus_polygon}'), st_geomfromwkb(geometry))"))

df_australia.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table}")

Automate with Databricks workflow

Click “Create Job” on the Workflow page.

Click “Edit parameters” under “Job parameters”. Enter the value of the job level parameters. These values will be passed to your notebook when your job runs.

Add the task-level parameters.

This is the Overture map data structure, we are creating all the tasks based on this structure.

Given the high degree of similarity among the tasks, switching to YAML code mode can streamline the task creation process. This allows you to copy, paste, and edit the tasks more efficiently, leveraging their similarities. Alternatively you can use the for each task capability that was recently released.

Now, we have successfully created a workflow with 14 tasks that run in parallel on the same job clusters.

Cluster Size	Time	Databricks Cost for this job
Single node	137 minutes	$1.71
1 driver + 2 workers	69 minutes	$2.59
1 driver + 4 workers	38 minutes	$2.38
1 driver + 8 workers	21 minutes	$2.36
1 driver + 16 workers	16 minutes	$3.40

VM type for driver and workers: Standard_D4ds_v5

You can process the world’s geo data for less than the price of an ice cream scoop! Databricks provides you with the flexibility to select the appropriate cluster size to align with the specific time constraints and budgetary considerations. And you can now start working on your Overture data!

Sample query to test it out:

WITH H3_BUILDING AS (
SELECT
  id,
  level,
  height,
  names.primary AS primary_name,
  st_astext(ST_GeomFromWkb(geometry)) AS geometry, 
  inline(h3_tessellateaswkb(geometry, 10))
FROM danny_catalog.overture.overture_buildings_building
WHERE names.primary IS NOT NULL
)
SELECT * except(geometry) FROM H3_BUILDING

It converts the geometry data of these buildings into a text format and applies a tessellation function to the geometry using the H3 geospatial indexing system. You can now do it with Spatial SQL on Databricks, without installing any additional libraries, with either a notebook or an SQL editor.

CREATE TABLE danny_catalog.overture.overture_buildings_building_h3 AS (
  WITH H3_BUILDING AS (
    SELECT
      id,
      level,
      height,
      names.primary AS primary_name,
      st_astext(ST_GeomFromWkb(geometry)) AS geometry, 
      inline(h3_tessellateaswkb(geometry, 10))
    FROM danny_catalog.overture.overture_buildings_building
    WHERE names.primary IS NOT NULL
  )
  SELECT * EXCEPT(geometry) 
  FROM H3_BUILDING 
  CLUSTER BY (cellid)
)

This SQL code creates a new table that is clustered using Databricks’ liquid clustering feature, specifically on these H3 cell IDs. By clustering data based on H3 indexes, the system ensures that spatially related information is stored together, greatly enhancing the speed and efficiency of spatial queries and analytics. This optimization is particularly valuable for big data applications, allowing for flexible, automatic, and efficient processing of massive geospatial datasets without the need for complex manual partitioning strategies.

Conclusion

In this blog, we've navigated the exciting journey of transforming Overture Maps data using Databricks, showcasing how to efficiently process and refine geospatial data at scale. By leveraging Databricks' powerful capabilities, including parameterised notebooks and automated workflows, we've streamlined the process of filtering and analysing complex datasets.

Ready to elevate your geospatial analytics? Dive into Databricks' geospatial capabilities and experience the power of Spatial SQL. Start your journey today and see how Databricks can uncover the stories behind your geospatial data!

In our next blog, we'll explore geospatial analytics in more depth, focusing on dynamic segmentation using Apache Sedona on Databricks. Don't miss the opportunity to see how these advanced techniques can further enhance your geospatial data analysis.

Databricks Community

From Raw to Refined: Processing Overture Maps Geospatial Data on Databricks - Part 1

Overview of Overture Maps Data

Setting Up Databricks for Geospatial Data Ingestion

Downloading the data

Create the filtering pipeline

Environmental setup

Processing

Automate with Databricks workflow

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL