topic Re: Low worker utilisation in Spatial SQL in Data Engineering

Low worker utilisation in Spatial SQL

james_ — Thu, 28 Aug 2025 05:38:39 GMT

I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.

When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, and the second worker ~10%. My code is referencing a delta table with WKT representation of the following ABS shapefile: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA1_2021_AUST_SHP_GDA2020.zip

I've tried repartitioning without success.

Am I doing something wrong?

(variables: geom_poly_col tells my notebook the WKT is in a column named 'geometry', and ls_cols_select is to select a subset of columns from the delta table.

Code:

# Create table of shapefile to H3 lookup with WKT geometries (using 'cover' method)

sdf = spark.table(source_table) \

.selectExpr("*", f"h3_coverash3({geom_poly_col}, {h3_zoom_level}) AS h3_cell_id") \

.withColumn("h3_zoom", fn.lit(h3_zoom_level).cast("int")) \

.withColumn("h3_cell_id", fn.explode(fn.col("h3_cell_id"))) \

.withColumn("h3_polygon", fn.expr(f"h3_boundaryaswkt(h3_cell_id)"))

sdf = sdf \

.select(*[c for c in sdf.columns if c in (ls_cols_select + [geom_poly_col, "h3_zoom", "h3_cell_id", "h3_polygon"])])

# Write to Silver

hive_target_lk_table = f"silver.{target_table}_lookup"

sdf.writeTo(hive_target_lk_table).createOrReplace()

Re: Low worker utilisation in Spatial SQL

-werners- — Thu, 28 Aug 2025 08:52:16 GMT

how many partitions do you have?
is the data significantly skewed?

Re: Low worker utilisation in Spatial SQL

james_ — Thu, 28 Aug 2025 10:33:36 GMT

Thank you for your reply, @-werners- . It turns out that partitioning was the issue, I changed it from ~2,500 to ~61,000 partitions (I think!) and it wrote in about half an hour. The partitions are very skewed, I haven't found a neat way to partition spatial data (other than using any built-in hierarchies) and am open to suggestions.

Re: Low worker utilisation in Spatial SQL

james_ — Thu, 28 Aug 2025 23:07:47 GMT

In case anyone else stumbles here, I think I had my partitioning the wrong way around above - going from more partitions to less fixed the issue.

Re: Low worker utilisation in Spatial SQL

-werners- — Fri, 29 Aug 2025 13:38:55 GMT

I was just gonna ask how 61K partitions made things better 🙂

To have less skew, you could experiment with some feature engineering (existing features combined that give less skew), or force larger files not based on file content.
But with the latter you won´t be able to apply partition pruning when reading.

Re: Low worker utilisation in Spatial SQL

james_ — Wed, 03 Sep 2025 01:18:23 GMT

Thank you again, @-werners- . I have a lot still to learn about partitioning and managing spatial data. Perhaps I mainly need more patience!