Databricks Community

james_ · ‎08-27-2025

I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.

When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, and the second worker ~10%. My code is referencing a delta table with WKT representation of the following ABS shapefile: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3...

I've tried repartitioning without success.

Am I doing something wrong?

(variables: geom_poly_col tells my notebook the WKT is in a column named 'geometry', and ls_cols_select is to select a subset of columns from the delta table.

Code:

# Create table of shapefile to H3 lookup with WKT geometries (using 'cover' method)

sdf = spark.table(source_table) \

.selectExpr("*", f"h3_coverash3({geom_poly_col}, {h3_zoom_level}) AS h3_cell_id") \

.withColumn("h3_zoom", fn.lit(h3_zoom_level).cast("int")) \

.withColumn("h3_cell_id", fn.explode(fn.col("h3_cell_id"))) \

.withColumn("h3_polygon", fn.expr(f"h3_boundaryaswkt(h3_cell_id)"))

sdf = sdf \

.select(*[c for c in sdf.columns if c in (ls_cols_select + [geom_poly_col, "h3_zoom", "h3_cell_id", "h3_polygon"])])

# Write to Silver

hive_target_lk_table = f"silver.{target_table}_lookup"

sdf.writeTo(hive_target_lk_table).createOrReplace()

-werners- · ‎08-28-2025

how many partitions do you have?
is the data significantly skewed?

james_ · ‎08-28-2025

Thank you for your reply, @-werners- . It turns out that partitioning was the issue, I changed it from ~2,500 to ~61,000 partitions (I think!) and it wrote in about half an hour. The partitions are very skewed, I haven't found a neat way to partition spatial data (other than using any built-in hierarchies) and am open to suggestions.

james_ · ‎08-28-2025

In case anyone else stumbles here, I think I had my partitioning the wrong way around above - going from more partitions to less fixed the issue.

-werners- · ‎08-29-2025

I was just gonna ask how 61K partitions made things better 🙂

To have less skew, you could experiment with some feature engineering (existing features combined that give less skew), or force larger files not based on file content.
But with the latter you won´t be able to apply partition pruning when reading.