cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
dannywong
Databricks Employee
Databricks Employee

This is the first part of a two-part series blog on geospatial data processing on Databricks. The first part will cover ingesting and processing Overture Maps data on Databricks, while the second part will delve into a practical use case on dynamic segmentation.

Geospatial data is transforming how we understand and interact with our world, but processing this data efficiently at scale remains a significant challenge.  

Overture Maps is a rich, open source geospatial dataset that promises to revolutionise mapping and location-based services. However, the sheer volume and complexity of this data can overwhelm traditional processing methods.  In this blog, we'll explore a practical approach to tackling this challenge with Databricks, focusing on filtering Overture map data for Victoria, Australia.  Along the way, we'll uncover techniques for parameterising notebooks, automating workflows, and benchmarking performance across different cluster sizes.

Overview of Overture Maps Data

The Overture Maps Foundation has adopted the GeoParquet format for publishing its geospatial data. While it's possible to access specific data subsets using tools like Apache Sedona on Spark, in this blog we focus on downloading the entire dataset and applying our own filters and transformations. This approach offers several advantages:

  • Data Sovereignty and Control: By ingesting the entire dataset, you gain complete control over your data. This approach ensures that you're not dependent on external services or potential changes in data accessibility.
  • Performance Optimization: Having the data within your own environment allows for fine-tuned performance optimisations. You can structure the data in ways that best suit your specific use cases (e.g. liquid clustering), potentially leading to faster query times and more efficient processing.
  • Customisation Flexibility: With the full dataset at your disposal, you have the freedom to create custom data models, apply your own transformations, or combine Overture data with other proprietary datasets seamlessly.

 

Setting Up Databricks for Geospatial Data Ingestion

Databricks is a data intelligence platform that provides robust analytics and machine learning for spatial and aspatial data. For geospatial data ingestion and processing, Databricks integrates seamlessly with various geospatial libraries and tools (e.g., Sedona, Databricks Mosaic, Geopandas, GDAL etc).   Recently Databricks has announced Spatial SQL, a native geospatial capability designed to enhance spatial data handling. As of this writing, Spatial SQL is in Private Preview for DBR 14.3+. Users interested in exploring this upcoming functionality should reach out to their Databricks account team to inquire about participating in the preview program.

 

Downloading the data

There are various ways to download the Overture maps data.  We'll use azcopy to copy the data to a UC volume using a Databricks Notebook.

Firstly, install azcopy on the Databricks cluster if you havenā€™t done so:

%sh
sudo bash -c 'cd /usr/local/bin; curl -L https://aka.ms/downloadazcopy-v10-linux | tar --strip-components=1 --exclude=*.txt -xzvf -; chmod +x azcopy'

This is my catalogue structure. I created a UC volume, danny_catalog.overture.raw, to store the geoparquet files. You can create yours with the UI or using SQL.

 

dannywong_21-1726647973766.png

Create a folder in the raw UC volume, then copy the data using azcopy to that folder:

%sh
mkdir -p /Volumes/danny_catalog/overture/raw/2024-08-20.0

azcopy copy "https://overturemapswestus2.dfs.core.windows.net/release/2024-08-20.0/" "/Volumes/danny_catalog/overture/raw/"  --recursive

Copying 431 GB of data (August 2024 release) from the Overture storage account to the UC volume took approximately 23 minutes. 

dannywong_22-1726648091386.png
However, it's important to note that transfer times can vary significantly based on several factors:

  • Geographic proximity: The time depends on the region of your Databricks workspace relative to the Overture data source.
  • Dataset size: Larger datasets will naturally take longer to transfer.
  • Network conditions: Transfer speeds can be affected by current network traffic and bandwidth.
  • Cluster resources: The specifications of your Databricks cluster may impact transfer speeds.

Create the filtering pipeline

To automate the filtering process, we will leverage Databricks workflows and parameterised notebooks. This pipeline will run on a monthly schedule, ensuring that the filtered data remains up-to-date. The core of this pipeline involves filtering the Overture Maps data based on a multi-polygon boundary that closely approximates the Victoria region, with adjustments to include areas near but slightly outside the official boundary as those are also the areas of interest. This filtered data will then be persisted as a Delta table, making it readily available for downstream systems and processes to consume. The filtering logic utilises Spatial SQL's powerful ST_ functions to efficiently process the geospatial data. By automating this process, we ensure consistent and reliable data updates, which are crucial for our subsequent analyses and applications.

Environmental setup

Letā€™s create the pipeline notebook.  Firstly we will create the notebook widgets:

dbutils.widgets.text("catalog", "") # Catalog name
dbutils.widgets.text("schema", "") # Schema name
dbutils.widgets.text("table", "") # The filtered geo data by theme and type
dbutils.widgets.text("volume", "") # The volume that holds the raw files
dbutils.widgets.text("map_theme", "") # Overture map theme
dbutils.widgets.text("map_type", "") # Overture map type
dbutils.widgets.text("release", "") # Overture release 
dbutils.widgets.text("aus_polygon", "") # The geometry for the filter

You can provide default values to the notebook widgets such as:

dbutils.widgets.text("catalog", "danny_catalog")

This is how it shows up on the notebook UI.  Populate the value that you desire for the widget and it will be referenced later:

dannywong_0-1726648458966.png

My aus_polygon value is:

MULTIPOLYGON(((146.224235224402 -35.4363358189844,148.225501405747 -35.647622056311,148.96662090591 -36.4718804131401,150.580844867015 -37.4059734691392,150.387767097845 -37.9947686179651,148.333245272028 -38.1583694792249,147.167982775491 -38.8351013798231,146.39794836319 -39.6562618160424,144.470264564547 -38.8532552527778,143.457314156555 -39.3875969932018,141.63332727031 -38.8880005741606,140.233033758295 -38.5024700559333,140.387650481347 -34.1564967083719,140.537933347279 -33.5832702977468,142.035376107123 -33.6588797210494,142.579114407923 -33.9833605220553,143.478747807493 -34.4440846825862,144.45098970567 -35.2900425539108,144.961566823313 -35.4071384438589,146.224235224402 -35.4363358189844)))

It is a multi-polygon that covers the State of Victoria in Australia with some buffers at the border. If you want to follow along, you can use this value or the geometry relevant to your use case.

Get the values from the notebook widget and store them as variables:

mapTheme = getArgument("map_theme")
mapType = getArgument("map_type")
catalog = getArgument("catalog")
schema = getArgument("schema")
table = getArgument("table")
volume = getArgument("volume")
release = getArgument("release")
aus_polygon = getArgument("aus_polygon")

Processing

Read the geoparquet file for the map theme and map type based on the input from the notebook widget.  Using the st_contains function from the Spatial SQL to filter data and only keeps those in the Victoria polygon.

from pyspark.sql import functions as F

df = spark.read.parquet(f"/Volumes/{catalog}/{schema}/{volume}/{release}/theme={mapTheme}/type={mapType}/")

df_australia = df.filter(F.expr(f"st_contains(st_geomfromwkt('{aus_polygon}'), st_geomfromwkb(geometry))"))

df_australia.write.mode("overwrite").saveAsTable(f"{catalog}.{schema}.{table}")

dannywong_0-1726648709788.png

 

Automate with Databricks workflow

Click ā€œCreate Jobā€ on the Workflow page.

dannywong_1-1726648742008.png

 

Click ā€œEdit parametersā€ under ā€œJob parametersā€.  Enter the value of the job level parameters.  These values will be passed to your notebook when your job runs.

dannywong_2-1726648776775.png

Add the task-level parameters. 

 

dannywong_3-1726648812004.png

This is the Overture map data structure, we are creating all the tasks based on this structure.

dannywong_4-1726648858345.png

Given the high degree of similarity among the tasks, switching to YAML code mode can streamline the task creation process. This allows you to copy, paste, and edit the tasks more efficiently, leveraging their similarities.  Alternatively you can use the for each task capability that was recently released.

 

dannywong_5-1726648897819.png
Now, we have successfully created a workflow with 14 tasks that run in parallel on the same job clusters.

dannywong_6-1726648931896.png

Cluster Size

Time

Databricks Cost for this job

Single node

137 minutes

$1.71

1 driver + 2 workers

69 minutes

$2.59

1 driver + 4 workers

38 minutes

$2.38

1 driver + 8 workers

21 minutes

$2.36

1 driver + 16 workers

16 minutes

$3.40

VM type for driver and workers: Standard_D4ds_v5

You can process the worldā€™s geo data for less than the price of an ice cream scoop!  Databricks provides you with the flexibility to select the appropriate cluster size to align with the specific time constraints and budgetary considerations.  And you can now start working on your Overture data!

Sample query to test it out:

WITH H3_BUILDING AS (
SELECT
  id,
  level,
  height,
  names.primary AS primary_name,
  st_astext(ST_GeomFromWkb(geometry)) AS geometry, 
  inline(h3_tessellateaswkb(geometry, 10))
FROM danny_catalog.overture.overture_buildings_building
WHERE names.primary IS NOT NULL
)
SELECT * except(geometry) FROM H3_BUILDING

dannywong_7-1726649044648.png

It converts the geometry data of these buildings into a text format and applies a tessellation function to the geometry using the H3 geospatial indexing system. You can now do it with Spatial SQL on Databricks, without installing any additional libraries, with either a notebook or an SQL editor.

CREATE TABLE danny_catalog.overture.overture_buildings_building_h3 AS (
  WITH H3_BUILDING AS (
    SELECT
      id,
      level,
      height,
      names.primary AS primary_name,
      st_astext(ST_GeomFromWkb(geometry)) AS geometry, 
      inline(h3_tessellateaswkb(geometry, 10))
    FROM danny_catalog.overture.overture_buildings_building
    WHERE names.primary IS NOT NULL
  )
  SELECT * EXCEPT(geometry) 
  FROM H3_BUILDING 
  CLUSTER BY (cellid)
)


This SQL code creates a new table that is clustered using Databricksā€™
liquid clustering feature, specifically on these H3 cell IDs. By clustering data based on H3 indexes, the system ensures that spatially related information is stored together, greatly enhancing the speed and efficiency of spatial queries and analytics. This optimization is particularly valuable for big data applications, allowing for flexible, automatic, and efficient processing of massive geospatial datasets without the need for complex manual partitioning strategies.

Conclusion

In this blog, we've navigated the exciting journey of transforming Overture Maps data using Databricks, showcasing how to efficiently process and refine geospatial data at scale. By leveraging Databricks' powerful capabilities, including parameterised notebooks and automated workflows, we've streamlined the process of filtering and analysing complex datasets.

Ready to elevate your geospatial analytics? Dive into Databricks' geospatial capabilities and experience the power of Spatial SQL.  Start your journey today and see how Databricks can uncover the stories behind your geospatial data!

In our next blog, we'll explore geospatial analytics in more depth, focusing on dynamic segmentation using Apache Sedona on Databricks. Don't miss the opportunity to see how these advanced techniques can further enhance your geospatial data analysis.

9 Comments
GC-James
Contributor II

Nice article. Couple of questions: 1. Curious about the H3 near the end. Why did you choose H3 resolution 10 for the tessellation? That seems very coarse for buildings. You end up with maybe 20 buildings per H3 index. I guess it depends what other data you want to join to your buildings and what resolution that is. 2. Would it not make sense to keep the WKT representation of the buildings, so that higher resolution spatial joins can be done as needed - rather than totally abstract to the H3 cell id? Thanks.

dannywong
Databricks Employee
Databricks Employee

Thank you for your questions!  You've raised some excellent points that deserve clarification:

  • The choice of H3 resolution 10 in the example was primarily for demonstration purposes.  The resolution choice should be tailored to the specific use case. For broad area analysis like finding buildings within a suburb, a coarser resolution might suffice. For more precise operations, a finer resolution would be appropriate.
  • H3 indexing provides an estimation that can be useful for quick spatial joins and proximity analysis. However, it's not intended to replace exact geometric operations.
  • A hybrid approach can be highly effective. Databricks supports combining H3 indexing with precise spatial functions like st_contains(). This allows for initial filtering using H3 followed by exact spatial operations where needed.
  • Maintaining the original WKT/WKB representation alongside H3 indexes offers the best of both worlds. It allows for flexible querying strategies depending on the specific requirements of each analysis.

Hope it clarifies!

yousry-mohamed
New Contributor II

Nice one Danny šŸ˜Ž
Quick question: in the cost table, the transition from 8 workers to 16 workers is not linear. Is there a reason behind this?

Will the same process work on serverless compute? The cost mentioned is probably for the DBUs only but there is still cost for cloud VMs. Hence it would be great to get an overall figure which could be simpler to calculate with serverless compute.

dannywong
Databricks Employee
Databricks Employee

Thank you, Yousry. I apologize for the error in my previous calculations regarding single node, two worker nodes, and four worker nodes. I have made the necessary modifications, and the numbers should reflect a linear progression soon after the page is updated.


I intentionally excluded the cloud VM costs from the calculation due to their variability. These costs depend on several factors, including the cloud region of deployment, applicable discounts, and whether you're using a pay-as-you-go or commitment plan. Additionally, pricing differs across various cloud providers. To simplify the comparison, I focused solely on the Databricks costs.


For a rough estimate, you can assume that $1 in DBU (Databricks Units) will typically incur between $1 and $1.5 in underlying VM costs.

mitchstares
New Contributor II

Great Article. Nice to see more use-cases for geospatial on databricks and leveraging the new spatial sql preview. 

You mentioned running the pipeline on a monthly schedule to keep the filtered data up-to-date. Presumably running the azcopy component each time. Are you automating the overture release parameter, or just manually running the job and updating the parameter with the new release?

 

dannywong
Databricks Employee
Databricks Employee

It is manual currently, as the release dates are different every month:
https://docs.overturemaps.org/release/latest/

We can potentially automate it by programmatically checking this page and getting the latest release version; if it is a new one, we can pass it as a parameter and trigger a run.

Thanks!

GC-James
Contributor II

@mitchelh -- I'm currently building an automated pipeline which I will run weekly to download Overture data. To know if there has been a new release there is this JSON file you can check: http://labs.overturemaps.org/data/releases.json 

{
    "latest": "2024-09-18.0",
    "releases": [
        "2024-09-18.0",
        "2024-08-20.0",
        "2024-07-22.0",
        "2024-06-13-beta.1",
        "2024-06-13-beta.0",
        "2024-05-16-beta.0",
        "2024-04-16-beta.0",
        "2024-03-12-alpha.0",
        "2024-02-15-alpha.0",
        "2024-01-17-alpha.0",
        "2023-12-14-alpha.0",
        "2023-11-14-alpha.0",
        "2023-10-19-alpha.0",
        "2023-07-26-alpha.0",
        "2023-04-02-alpha"
    ]
}

. I compare the 'latest' field with whatever version we have in our Lake, and then download the new one if they don't match.

dannywong
Databricks Employee
Databricks Employee

That's awesome James!

mitchstares
New Contributor II

Thanks @GC-James, that JSON is a great resource. I will definitely be building that into our pipeline!