topic Re: Pyspark Dataframes orderby only orders within partition when having multiple worker in Data Engineering

Pyspark Dataframes orderby only orders within partition when having multiple worker

dbx-user7354 — Thu, 21 Mar 2024 09:46:09 GMT

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.

from pyspark.sql import functions as F import matplotlib.pyplot as plt import numpy as np num_rows = 1000000 num_cols = 300 # Create DataFrame mit 1 Million Rows and 300 columns with random data columns = ["col_" + str(i) for i in range(num_cols)] data = spark.range(0, num_rows) # create id column first data = data.repartition(1) # we need this here so monotonically_increasing_id gives all numbers sequentially data = data.withColumn("test1", F.monotonically_increasing_id()) data = data.orderBy(F.rand()) data = data.repartition(10) for col_name in columns: data = data.withColumn(col_name, F.rand()) # default sorting which leads to wrong sorting data = data.orderBy("test1", ascending=False) test2 = data.select(F.collect_list("test1")).first()[0] plt.plot(test2, color="red") # test sorting after repartioning to 1 partition data2 = data.repartition(1) data2 = data2.orderBy("test1", ascending=False) test3 = data2.select(F.collect_list("test1")).first()[0] plt.plot(test3, color="red")

The first plot looks like this:

The second plot looks like this:

The second plot (after repartitioning to 1 partition) shows the correct sorting.

Is this a known issue? If so it is a pyspark issue or a databricks issue? When having only one worker both plots are correct.

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

dbx-user7354 — Thu, 21 Mar 2024 12:23:48 GMT

Thanks for your quick answer. Where can I find the information that orderby() or sort() is only sorting within the partition? The official doc does not mention this.

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

MarkusFra — Thu, 21 Mar 2024 12:52:25 GMT

@Retired_mod

Sorry if I have to ask again, but I am a bit confused by this.

I thought, that pysparks `orderBy()` and `sort()` do a shuffle operation before the sorting for exact this reason. There is another command `sortWithinPartitions()` that does not do that and does a partition wise sorting. I am acutally suprised that `sort()` also only works partition wise. But then: Why does it work on a singleNode Cluster on a partitioned DataFrame?

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

NandiniN — Fri, 17 Jan 2025 04:01:23 GMT

The orderBy function in PySpark is expected to perform a global sort, which involves shuffling the data across partitions to ensure that the entire DataFrame is sorted. This is different from sortWithinPartitions, which only sorts data within each partition.

Let me try your program and understand the results further.

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

NandiniN — Fri, 17 Jan 2025 11:11:28 GMT

Both before and after repartition I see the same results for orderBy

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

NemesisMF — Fri, 17 Jan 2025 11:39:10 GMT

@NandiniN

Did you try with a multiple worker cluster? Which Runtime with which spark version did you use?

Maybe it would be good to test with Runtime 13.3, then we would know that it was fixed in the meantime.

I found this on StackOverflow. Seems someone had a similar problem: https://stackoverflow.com/questions/55860388/pyspark-dataframe-orderby-partition-level-or-overall

There is also a very old HIVE bug ticket that was never resolved. Not sure, if it could be connected:

https://issues.apache.org/jira/browse/HIVE-10417

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

NandiniN — Sun, 26 Jan 2025 11:26:55 GMT

okay, I do see a difference in 13.3, while not in 15.4.

For your tests would you be able to use higher DBR?

It means, that the issue is resolved in higher DBR and could be some improvement.

Re: Pyspark Dataframes orderby only orders within partition when having multiple worker

Avinash_Narala — Mon, 27 Jan 2025 06:28:10 GMT

Hi @dbx-user7354 ,

OrderBy() should perform a global sort as showed in plot-2, but as per your problem it is sorting the data within the partitions which is the behavior of sortWithinPartitions(), so to solve this error. Please try with the latest DBR runtime and then try to check the result. I think the problem is from databricks runtime side.

If this helps, mark it as solution.

Regards,

Avinash N