<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Struggle to parallelize UDF in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122062#M10189</link>
    <description>&lt;P&gt;I sort of fixed it myself. Screenshot above was incorrect for the shared compute.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_1-1750219005650.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17592i3274A89BDD21C113/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_1-1750219005650.png" alt="Dimitry_1-1750219005650.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;and the fix was in changing the access mode&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_2-1750219043277.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17593i41A1420560369FAA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_2-1750219043277.png" alt="Dimitry_2-1750219043277.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_4-1750219253241.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17595iBC45093A779B33E0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_4-1750219253241.png" alt="Dimitry_4-1750219253241.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 18 Jun 2025 04:01:00 GMT</pubDate>
    <dc:creator>Dimitry</dc:creator>
    <dc:date>2025-06-18T04:01:00Z</dc:date>
    <item>
      <title>Struggle to parallelize UDF</title>
      <link>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122055#M10188</link>
      <description>&lt;P&gt;Hi all&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have 2 clusters, that look identical but one runs my UDF in parallel another one does not.&lt;/P&gt;&lt;P&gt;The ones that do is personal, the bad one is shared.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import pandas as pd
from datetime import datetime
from time import sleep
import threading

# test function
def func(x: pd.DataFrame):
    sleep(1)
    return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})

# native
sdf = spark.range(start=0, end=40, step=1, numPartitions=8)

now = datetime.now()
sdf = sdf.groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())

display(result.groupBy("thread").count())&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;Personal cluster splits into 4 threads (as CPUs) but the shared one doesn't&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_0-1750216264118.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17587i6EC9AFB41F87D845/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_0-1750216264118.png" alt="Dimitry_0-1750216264118.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_1-1750216332766.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17588i4362C02D72B7C200/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_1-1750216332766.png" alt="Dimitry_1-1750216332766.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is personal vs shared clusters configuration, I don't get what is making them to work differently.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_3-1750216642622.png" style="width: 636px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17590iD00ABA719CE6654E/image-dimensions/636x271?v=v2" width="636" height="271" role="button" title="Dimitry_3-1750216642622.png" alt="Dimitry_3-1750216642622.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note that in the real code I'm using&amp;nbsp;&lt;SPAN&gt;repartition to achieve the same effect and it also works on the personal cluster but not on the shared.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Please help!!&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;_sqldf.&lt;/SPAN&gt;&lt;SPAN&gt;repartition&lt;/SPAN&gt;&lt;SPAN&gt;(max_number_of_threads, &lt;/SPAN&gt;&lt;SPAN&gt;"batch_id"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;groupBy&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"batch_id"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;applyInPandas(..)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 18 Jun 2025 03:19:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122055#M10188</guid>
      <dc:creator>Dimitry</dc:creator>
      <dc:date>2025-06-18T03:19:48Z</dc:date>
    </item>
    <item>
      <title>Re: Struggle to parallelize UDF</title>
      <link>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122062#M10189</link>
      <description>&lt;P&gt;I sort of fixed it myself. Screenshot above was incorrect for the shared compute.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_1-1750219005650.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17592i3274A89BDD21C113/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_1-1750219005650.png" alt="Dimitry_1-1750219005650.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;and the fix was in changing the access mode&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_2-1750219043277.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17593i41A1420560369FAA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_2-1750219043277.png" alt="Dimitry_2-1750219043277.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_4-1750219253241.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/17595iBC45093A779B33E0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_4-1750219253241.png" alt="Dimitry_4-1750219253241.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jun 2025 04:01:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122062#M10189</guid>
      <dc:creator>Dimitry</dc:creator>
      <dc:date>2025-06-18T04:01:00Z</dc:date>
    </item>
    <item>
      <title>Re: Struggle to parallelize UDF</title>
      <link>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122063#M10190</link>
      <description>&lt;P&gt;As a side note "no isolation shared" cluster has no access to unity catalog, so no table queries.&lt;/P&gt;&lt;P&gt;I resorted to using personal compute assigned to a group.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jun 2025 04:42:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/struggle-to-parallelize-udf/m-p/122063#M10190</guid>
      <dc:creator>Dimitry</dc:creator>
      <dc:date>2025-06-18T04:42:02Z</dc:date>
    </item>
  </channel>
</rss>

