<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pyspark Dataframes orderby only orders within partition when having multiple worker in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/107058#M42692</link>
    <description>&lt;P&gt;okay, I do see a difference in 13.3, while not in 15.4.&lt;/P&gt;
&lt;P&gt;For your tests would you be able to use higher DBR?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It means, that the issue is resolved in higher DBR and could be some improvement.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 26 Jan 2025 11:26:55 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-01-26T11:26:55Z</dc:date>
    <item>
      <title>Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64263#M32509</link>
      <description>&lt;P&gt;I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import functions as F
import matplotlib.pyplot as plt
import numpy as np

num_rows = 1000000
num_cols = 300

# Create DataFrame mit 1 Million Rows and 300 columns with random data
columns = ["col_" + str(i) for i in range(num_cols)]
data = spark.range(0, num_rows)

# create id column first
data = data.repartition(1) # we need this here so monotonically_increasing_id gives all numbers sequentially
data = data.withColumn("test1", F.monotonically_increasing_id())
data = data.orderBy(F.rand())
data = data.repartition(10)

for col_name in columns:
    data = data.withColumn(col_name, F.rand())

# default sorting which leads to wrong sorting
data = data.orderBy("test1", ascending=False)
test2 = data.select(F.collect_list("test1")).first()[0]
plt.plot(test2, color="red")

# test sorting after repartioning to 1 partition
data2 = data.repartition(1)
data2 = data2.orderBy("test1", ascending=False)
test3 = data2.select(F.collect_list("test1")).first()[0]
plt.plot(test3, color="red")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The first plot looks like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="dbxuser7354_0-1711014288660.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6738iB76CBB44D276823C/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="dbxuser7354_0-1711014288660.png" alt="dbxuser7354_0-1711014288660.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The second plot looks like this:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="dbxuser7354_1-1711014300462.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6739i00B7108A3EE246BA/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="dbxuser7354_1-1711014300462.png" alt="dbxuser7354_1-1711014300462.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The second plot (after repartitioning to 1 partition) shows the correct sorting.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is this a known issue? If so it is a pyspark issue or a databricks issue? When having only one worker both plots are correct.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Mar 2024 09:46:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64263#M32509</guid>
      <dc:creator>dbx-user7354</dc:creator>
      <dc:date>2024-03-21T09:46:09Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64284#M32520</link>
      <description>&lt;P&gt;Thanks for your quick answer. Where can I find the information that orderby() or sort() is only sorting within the partition? The &lt;A href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sort.html" target="_self"&gt;official doc&lt;/A&gt; does not mention this.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Mar 2024 12:23:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64284#M32520</guid>
      <dc:creator>dbx-user7354</dc:creator>
      <dc:date>2024-03-21T12:23:48Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64290#M32524</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry if I have to ask again, but I am a bit confused by this.&lt;/P&gt;&lt;P&gt;I thought, that pysparks `orderBy()` and `sort()` do a shuffle operation before the sorting for exact this reason. There is another command `sortWithinPartitions()` that does not do that and does a partition wise sorting. I am acutally suprised that `sort()` also only works partition wise. But then: Why does it work on a singleNode Cluster on a partitioned DataFrame?&lt;/P&gt;</description>
      <pubDate>Thu, 21 Mar 2024 12:52:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/64290#M32524</guid>
      <dc:creator>MarkusFra</dc:creator>
      <dc:date>2024-03-21T12:52:25Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106015#M42349</link>
      <description>&lt;P&gt;The &lt;CODE&gt;orderBy&lt;/CODE&gt; function in PySpark is expected to perform a global sort, which involves shuffling the data across partitions to ensure that the entire DataFrame is sorted. This is different from &lt;CODE&gt;sortWithinPartitions&lt;/CODE&gt;, which only sorts data within each partition.&lt;/P&gt;
&lt;P&gt;Let me try your program and understand the results further.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 04:01:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106015#M42349</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-01-17T04:01:23Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106062#M42371</link>
      <description>&lt;P&gt;Both before and after repartition I see the same results for&amp;nbsp;&lt;SPAN&gt;orderBy&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 11:11:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106062#M42371</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-01-17T11:11:28Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106065#M42373</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/23233"&gt;@NandiniN&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Did you try with a multiple worker cluster? Which Runtime with which spark version did you use?&lt;/P&gt;&lt;P&gt;Maybe it would be good to test with Runtime 13.3, then we would know that it was fixed in the meantime.&lt;/P&gt;&lt;P&gt;I found this on StackOverflow. Seems someone had a similar problem: &lt;A href="https://stackoverflow.com/questions/55860388/pyspark-dataframe-orderby-partition-level-or-overall" target="_blank"&gt;https://stackoverflow.com/questions/55860388/pyspark-dataframe-orderby-partition-level-or-overall&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There is also a very old HIVE bug ticket that was never resolved. Not sure, if it could be connected:&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/HIVE-10417" target="_blank"&gt;https://issues.apache.org/jira/browse/HIVE-10417&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 11:39:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/106065#M42373</guid>
      <dc:creator>NemesisMF</dc:creator>
      <dc:date>2025-01-17T11:39:10Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/107058#M42692</link>
      <description>&lt;P&gt;okay, I do see a difference in 13.3, while not in 15.4.&lt;/P&gt;
&lt;P&gt;For your tests would you be able to use higher DBR?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It means, that the issue is resolved in higher DBR and could be some improvement.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 26 Jan 2025 11:26:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/107058#M42692</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-01-26T11:26:55Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark Dataframes orderby only orders within partition when having multiple worker</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/107110#M42704</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/96791"&gt;@dbx-user7354&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;OrderBy&lt;/STRONG&gt;() should perform a global sort as showed in &lt;STRONG&gt;plot-2&lt;/STRONG&gt;, but as per your problem it is sorting the data within the partitions which is the behavior of &lt;STRONG&gt;sortWithinPartitions&lt;/STRONG&gt;(), so to solve this error. Please try with the latest DBR runtime and then try to check the result. I think the problem is from databricks runtime side.&lt;/P&gt;&lt;P&gt;If this helps, mark it as solution.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Avinash N&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jan 2025 06:28:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframes-orderby-only-orders-within-partition-when/m-p/107110#M42704</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2025-01-27T06:28:10Z</dc:date>
    </item>
  </channel>
</rss>

