Databricks Community

dannywong · ‎09-30-2024

This is the second part of a two-part series blog on geospatial data processing on Databricks. In the first part, we covered ingesting and processing Overture Maps data on Databricks. In this second part, we will delve into a practical use case on dynamic segmentation.

Imagine driving down a winding road where the speed limit changes every few hundred metres, or navigating a city where the pavement conditions shift from smooth asphalt to bumpy gravel. This dynamic nature of our road networks presents a fascinating challenge for location intelligence. Dynamic segmentation is a powerful technique that allows us to slice and dice linear features based on varying attributes.

For instance, city planners might use dynamic segmentation to determine whether sections with higher curbs experience fewer pedestrian-related accidents compared to those with lower curbs, or how accident rates change in areas where the speed limit fluctuates. This granular approach to road network analysis enables city planners and traffic engineers to identify high-risk zones and implement targeted safety measures, potentially saving lives and reducing injuries.

While this method can be applied in a multitude of scenarios, from environmental monitoring to urban planning, this blog post will zoom in on road networks as a captivating use case.

Understanding Dynamic Segmentation

Dynamic segmentation is the process of dividing linear features into segments based on changing attributes along their length. This technique is particularly useful for analysing and visualising how properties like speed limits, pavement conditions, or traffic volumes vary along a road network.

Dynamic segmentation creates variable-length segments that accurately represent changes in attributes. This approach provides a more precise representation of real-world conditions and enables more nuanced analysis.

Apache Sedona and Databricks: A Powerful Combination

The Databricks Lakehouse platform offers a powerful and flexible environment for processing geospatial data at scale, through built-in product features as well as by using various 3rd party libraries. One popular library, among many, is Apache Sedona, a geospatial data processing Apache Spark-based framework. Sedona has some useful functions for the focus of this dynamic segmentation use case, which can be applied to augment our built-in capabilities.

Databricks enhances geospatial workloads with innovative features like Liquid Clustering, which simplifies data layout and improves query performance; 30+ native H3 global gridding functions, enabling highly scalable discrete spatial analytics; and 60+ Spatial SQL functions, currently in private preview for DBR 14.3+ (reach out to your Databricks sales team to join the preview).

Users can easily install Apache Sedona on their Databricks clusters by following straightforward instructions. This extensibility, combined with the platform's distributed processing power and performance optimizations, positions Databricks as an ideal choice for organizations dealing with large-scale geospatial analytics and dynamic segmentation tasks.

Implementing Dynamic Segmentation

Let's explore how to perform dynamic segmentation using Apache Sedona on Databricks. We'll use a road network dataset and segment it based on different attributes.

Example 1: Segmenting Roads by Pavement Condition

In this example, we'll segment a road network (Rte) based on pavement condition (PC):

SELECT     PC.Classified_Road_Number
         , PC.Direction
         , PC.route
         , PC.Surface_Type
         , PC.Roughness_Category
         , PC.Start_Chainage_m
         , PC.End_Chainage_m
         , ST_LineSubstring(  Rte.geometry
                            , PC.Start_Chainage_m/Rte.ARCLENGTH
                            , PC.End_Chainage_m/Rte.ARCLENGTH
                            ) as geometry
FROM      pavement_condition PC, routes Rte
WHERE     PC.route = Rte.ROUTE_ID 
AND       Classified_Road_Number = 1234

In transportation planning or traffic analysis, Sedona’s ST_LineSubstring can be used to extract a particular segment of a road or path for detailed study, such as a stretch of road where frequent accidents occur.

Rte.geometry is the full geometry of the route.
PC.Start_Chainage_m/Rte.ARCLENGTH calculates the start point of the substring as a fraction of the total route length.
PC.End_Chainage_m/Rte.ARCLENGTH calculates the end point of the substring as a fraction of the total route length.

For large LINESTRING geometries (e.g., routes), ST_LineSubstring can be used to create smaller, more manageable segments (e.g., Pavement conditions) for analysis. This can be particularly useful when working with large datasets or when only a specific section of the data is relevant.

Example 2: Finding measure value along a route

This example is to calculate the distance along a specific bus route (RTE) to each bus stop (BS) on that route.

SELECT     RTE.ROUTE_ID
         , BS.Location_Description
         , ST_LineLocatePoint(RTE.geometry, BS.geometry) * ARCLENGTH as measure
FROM     route_bus_stops BS, routes RTE
WHERE    RTE.ROUTE_ID = 1234
ORDER BY measure

For managing transportation assets such as bus stops, signage, and maintenance points, Sedona’s ST_LineLocatePoint can help pinpoint their exact locations on the road network. This aids in asset inventory management, maintenance scheduling, and optimising the placement of new assets.

ST_LineLocatePoint(RTE.geometry, BS.geometry) returns a fraction between 0 and 1, representing where the bus stop point is located along the route line.
This fraction is then multiplied by ARCLENGTH (the total length of the route) to convert it into an actual distance measure.

This is a useful application of spatial analysis in transportation planning. It can help in:

Visualising the distribution of bus stops along a route
Calculating distances between consecutive stops
Analysing the coverage and accessibility of public transport along the route

Benefits of Dynamic Segmentation on Databricks

Scalability: Efficiently process large road networks with the distributed computing environment, while Delta Lake ensures high performance, transactional reliability, and governance.
Flexibility: Easily adapt dynamic segmentation to various attributes or conditions, with Liquid Clustering enabling data layout evolution without rewrites.
Integration: Results can be seamlessly integrated with other data analysis workflows on the Databricks Platform to combine the power of spatial and aspatial data on a single unified platform.
Visualisation: Visualize segmented data using popular GIS tools or directly within Databricks Notebooks using open source libraries like kepler.gl.

Conclusion

Dynamic segmentation opens up a world of possibilities in location intelligence, transforming how we understand and interact with our road networks. By leveraging Apache Sedona on Databricks, we can slice through complex data to reveal insights about speed limits, pavement conditions, and bus stop locations. Buckle up and get ready to uncover the hidden narratives in your datasets!

If you haven't already, make sure to check out the first part of our series, where we discussed the foundational steps of processing Overture Maps data on Databricks. Together, these two-part series give you a couple of practical examples of running geospatial workloads on Databricks.

Databricks Community

Dynamic Segmentation in Geospatial Analytics on Databricks - Part 2

Understanding Dynamic Segmentation

Apache Sedona and Databricks: A Powerful Combination

Implementing Dynamic Segmentation

Example 1: Segmenting Roads by Pavement Condition

Example 2: Finding measure value along a route

Benefits of Dynamic Segmentation on Databricks

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks