I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.
from pyspark.sql import functions as F
import matplotlib.pyplot as plt
import numpy as np
num_rows = 1000000
num_cols = 300
# Create DataFrame mit 1 Million Rows and 300 columns with random data
columns = ["col_" + str(i) for i in range(num_cols)]
data = spark.range(0, num_rows)
# create id column first
data = data.repartition(1) # we need this here so monotonically_increasing_id gives all numbers sequentially
data = data.withColumn("test1", F.monotonically_increasing_id())
data = data.orderBy(F.rand())
data = data.repartition(10)
for col_name in columns:
data = data.withColumn(col_name, F.rand())
# default sorting which leads to wrong sorting
data = data.orderBy("test1", ascending=False)
test2 = data.select(F.collect_list("test1")).first()[0]
plt.plot(test2, color="red")
# test sorting after repartioning to 1 partition
data2 = data.repartition(1)
data2 = data2.orderBy("test1", ascending=False)
test3 = data2.select(F.collect_list("test1")).first()[0]
plt.plot(test3, color="red")
The first plot looks like this:
The second plot looks like this:
The second plot (after repartitioning to 1 partition) shows the correct sorting.
Is this a known issue? If so it is a pyspark issue or a databricks issue? When having only one worker both plots are correct.