I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.
When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, and the second worker ~10%. My code is referencing a delta table with WKT representation of the following ABS shapefile: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3...
I've tried repartitioning without success.
Am I doing something wrong?
(variables: geom_poly_col tells my notebook the WKT is in a column named 'geometry', and ls_cols_select is to select a subset of columns from the delta table.
Code:
# Create table of shapefile to H3 lookup with WKT geometries (using 'cover' method)
sdf = spark.table(source_table) \
.selectExpr("*", f"h3_coverash3({geom_poly_col}, {h3_zoom_level}) AS h3_cell_id") \
.withColumn("h3_zoom", fn.lit(h3_zoom_level).cast("int")) \
.withColumn("h3_cell_id", fn.explode(fn.col("h3_cell_id"))) \
.withColumn("h3_polygon", fn.expr(f"h3_boundaryaswkt(h3_cell_id)"))
sdf = sdf \
.select(*[c for c in sdf.columns if c in (ls_cols_select + [geom_poly_col, "h3_zoom", "h3_cell_id", "h3_polygon"])])
# Write to Silver
hive_target_lk_table = f"silver.{target_table}_lookup"
sdf.writeTo(hive_target_lk_table).createOrReplace()