<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Low worker utilisation in Spatial SQL in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130160#M48720</link>
    <description>&lt;P&gt;I was just gonna ask how 61K partitions made things better &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;To have less skew, you could experiment with some feature engineering (existing features combined that give less skew), or force larger files not based on file content.&lt;BR /&gt;But with the latter you won´t be able to apply partition pruning when reading.&lt;/P&gt;</description>
    <pubDate>Fri, 29 Aug 2025 13:38:55 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2025-08-29T13:38:55Z</dc:date>
    <item>
      <title>Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129960#M48654</link>
      <description>&lt;P&gt;I am finding low worker node utilization when using Spatial SQL features. My cluster is DBR 17.1 with 2x workers and photon enabled.&lt;/P&gt;&lt;P&gt;When I view the cluster metrics, they consistently show one worker around 30-50% utilized, the driver around 15-20%, and the second worker ~10%. My code is referencing a delta table with WKT representation of the following ABS shapefile:&amp;nbsp;&lt;A href="https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA1_2021_AUST_SHP_GDA2020.zip" target="_blank"&gt;https://www.abs.gov.au/statistics/standards/australian-statistical-geography-standard-asgs-edition-3/jul2021-jun2026/access-and-downloads/digital-boundary-files/SA1_2021_AUST_SHP_GDA2020.zip&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I've tried repartitioning without success.&lt;/P&gt;&lt;P&gt;Am I doing something wrong?&lt;/P&gt;&lt;P&gt;&lt;FONT size="2"&gt;(variables: &lt;FONT face="courier new,courier"&gt;geom_poly_col&lt;/FONT&gt; tells my notebook the WKT is in a column named 'geometry', and &lt;FONT face="courier new,courier"&gt;ls_cols_select&lt;/FONT&gt; is to select a subset of columns from the delta table.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;Code:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;# Create table of shapefile to H3 lookup with WKT geometries (using 'cover' method)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;sdf &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.&lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;(source_table) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;selectExpr&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"*"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"h3_coverash3(&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;geom_poly_col&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;h3_zoom_level&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;) AS h3_cell_id"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"h3_zoom"&lt;/SPAN&gt;&lt;SPAN&gt;, fn.&lt;/SPAN&gt;&lt;SPAN&gt;lit&lt;/SPAN&gt;&lt;SPAN&gt;(h3_zoom_level).&lt;/SPAN&gt;&lt;SPAN&gt;cast&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"int"&lt;/SPAN&gt;&lt;SPAN&gt;)) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"h3_cell_id"&lt;/SPAN&gt;&lt;SPAN&gt;, fn.&lt;/SPAN&gt;&lt;SPAN&gt;explode&lt;/SPAN&gt;&lt;SPAN&gt;(fn.&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"h3_cell_id"&lt;/SPAN&gt;&lt;SPAN&gt;))) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;withColumn&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"h3_polygon"&lt;/SPAN&gt;&lt;SPAN&gt;, fn.&lt;/SPAN&gt;&lt;SPAN&gt;expr&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"h3_boundaryaswkt(h3_cell_id)"&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;sdf &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; sdf \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;select&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;SPAN&gt;[c &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; c &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; sdf.columns &lt;/SPAN&gt;&lt;SPAN&gt;if&lt;/SPAN&gt;&lt;SPAN&gt; c &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; (ls_cols_select &lt;/SPAN&gt;&lt;SPAN&gt;+&lt;/SPAN&gt;&lt;SPAN&gt; [geom_poly_col, &lt;/SPAN&gt;&lt;SPAN&gt;"h3_zoom"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"h3_cell_id"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"h3_polygon"&lt;/SPAN&gt;&lt;SPAN&gt;])])&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;# Write to Silver&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;hive_target_lk_table &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"silver.&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;target_table&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;_lookup"&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT face="courier new,courier" size="2" color="#0000FF"&gt;&lt;SPAN&gt;sdf.&lt;/SPAN&gt;&lt;SPAN&gt;writeTo&lt;/SPAN&gt;&lt;SPAN&gt;(hive_target_lk_table).&lt;/SPAN&gt;&lt;SPAN&gt;createOrReplace&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 28 Aug 2025 05:38:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129960#M48654</guid>
      <dc:creator>james_</dc:creator>
      <dc:date>2025-08-28T05:38:39Z</dc:date>
    </item>
    <item>
      <title>Re: Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129987#M48663</link>
      <description>&lt;P&gt;how many partitions do you have?&lt;BR /&gt;is the data significantly skewed?&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 08:52:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129987#M48663</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-08-28T08:52:16Z</dc:date>
    </item>
    <item>
      <title>Re: Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129999#M48666</link>
      <description>&lt;P&gt;Thank you for your reply,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;. It turns out that partitioning was the issue, I changed it from ~2,500 to ~61,000 partitions (I think!) and it wrote in about half an hour. The partitions are very skewed, I haven't found a neat way to partition spatial data (other than using any built-in hierarchies) and am open to suggestions.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 10:33:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/129999#M48666</guid>
      <dc:creator>james_</dc:creator>
      <dc:date>2025-08-28T10:33:36Z</dc:date>
    </item>
    <item>
      <title>Re: Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130106#M48697</link>
      <description>&lt;P&gt;In case anyone else stumbles here, I think I had my partitioning the wrong way around above - going from more partitions to less fixed the issue.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 23:07:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130106#M48697</guid>
      <dc:creator>james_</dc:creator>
      <dc:date>2025-08-28T23:07:47Z</dc:date>
    </item>
    <item>
      <title>Re: Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130160#M48720</link>
      <description>&lt;P&gt;I was just gonna ask how 61K partitions made things better &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;To have less skew, you could experiment with some feature engineering (existing features combined that give less skew), or force larger files not based on file content.&lt;BR /&gt;But with the latter you won´t be able to apply partition pruning when reading.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Aug 2025 13:38:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130160#M48720</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-08-29T13:38:55Z</dc:date>
    </item>
    <item>
      <title>Re: Low worker utilisation in Spatial SQL</title>
      <link>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130571#M48839</link>
      <description>&lt;P&gt;Thank you again,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;. I have a lot still to learn about partitioning and managing spatial data. Perhaps I mainly need more patience!&lt;/P&gt;</description>
      <pubDate>Wed, 03 Sep 2025 01:18:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/low-worker-utilisation-in-spatial-sql/m-p/130571#M48839</guid>
      <dc:creator>james_</dc:creator>
      <dc:date>2025-09-03T01:18:23Z</dc:date>
    </item>
  </channel>
</rss>

