Problem with sparkContext.parallelize and volatile functions?

del1000 — Mon, 29 May 2023 10:14:10 GMT

I have a code:

from time import sleep
from random import random
from operator import add
 
def f(a: int) -> float:
    sleep(0.1)
    return random()
  
rdd1 = sc.parallelize(range(20), 2)
rdd2 = sc.parallelize(range(20), 2)
rdd3 = sc.parallelize(range(20), 2)
print('result a1:', rdd1.map(f).reduce(add))
print('result a2:', rdd2.map(f).reduce(add))
print('result a3:', rdd3.map(f).reduce(add))
print('result b3:', sum([f(a) for a in range(20)]))
print('result b3:', sum([f(a) for a in range(20)]))
print('result b3:', sum([f(a) for a in range(20)]))

sample result of it:

result a1: 9.80073680418538
result a2: 9.80073680418538
result a3: 9.80073680418538
result b3: 9.219767385799257
result b3: 8.175800896981904
result b3: 9.417623482504323

May anybody explain me why results a* have the same value? In my opinion, all results lines should be different each other.

How to correct the code to be sure results a* are different?

Tested using Runtime 10 and 12.

topic Problem with sparkContext.parallelize and volatile functions? in Data Engineering

Problem with sparkContext.parallelize and volatile functions?