cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Low worker utilisation in Spatial SQL

james_
New Contributor II

I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.

When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, and the second worker ~10%. My code is referencing a delta table with WKT representation of the following ABS shapefile: https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3...

I've tried repartitioning without success.

Am I doing something wrong?

(variables: geom_poly_col tells my notebook the WKT is in a column named 'geometry', and ls_cols_select is to select a subset of columns from the delta table.

Code:

# Create table of shapefile to H3 lookup with WKT geometries (using 'cover' method)
sdf = spark.table(source_table) \
    .selectExpr("*", f"h3_coverash3({geom_poly_col}, {h3_zoom_level}) AS h3_cell_id") \
    .withColumn("h3_zoom", fn.lit(h3_zoom_level).cast("int")) \
    .withColumn("h3_cell_id", fn.explode(fn.col("h3_cell_id"))) \
    .withColumn("h3_polygon", fn.expr(f"h3_boundaryaswkt(h3_cell_id)"))
sdf = sdf \
    .select(*[c for c in sdf.columns if c in (ls_cols_select + [geom_poly_col, "h3_zoom", "h3_cell_id", "h3_polygon"])])
   
# Write to Silver
hive_target_lk_table = f"silver.{target_table}_lookup"
sdf.writeTo(hive_target_lk_table).createOrReplace()
5 REPLIES 5

-werners-
Esteemed Contributor III

how many partitions do you have?
is the data significantly skewed?

james_
New Contributor II

Thank you for your reply, @-werners- . It turns out that partitioning was the issue, I changed it from ~2,500 to ~61,000 partitions (I think!) and it wrote in about half an hour. The partitions are very skewed, I haven't found a neat way to partition spatial data (other than using any built-in hierarchies) and am open to suggestions.

james_
New Contributor II

In case anyone else stumbles here, I think I had my partitioning the wrong way around above - going from more partitions to less fixed the issue.

-werners-
Esteemed Contributor III

I was just gonna ask how 61K partitions made things better 🙂

To have less skew, you could experiment with some feature engineering (existing features combined that give less skew), or force larger files not based on file content.
But with the latter you won´t be able to apply partition pruning when reading.

james_
New Contributor II

Thank you again, @-werners- . I have a lot still to learn about partitioning and managing spatial data. Perhaps I mainly need more patience!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now