cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with sparkContext.parallelize and volatile functions?

del1000
New Contributor III

I have a code:

from time import sleep
from random import random
from operator import add
 
def f(a: int) -> float:
    sleep(0.1)
    return random()
  
rdd1 = sc.parallelize(range(20), 2)
rdd2 = sc.parallelize(range(20), 2)
rdd3 = sc.parallelize(range(20), 2)
print('result a1:', rdd1.map(f).reduce(add))
print('result a2:', rdd2.map(f).reduce(add))
print('result a3:', rdd3.map(f).reduce(add))
print('result b3:', sum([f(a) for a in range(20)]))
print('result b3:', sum([f(a) for a in range(20)]))
print('result b3:', sum([f(a) for a in range(20)]))

sample result of it:

result a1: 9.80073680418538
result a2: 9.80073680418538
result a3: 9.80073680418538
result b3: 9.219767385799257
result b3: 8.175800896981904
result b3: 9.417623482504323

May anybody explain me why results a* have the same value? In my opinion, all results lines should be different each other.

How to correct the code to be sure results a* are different?

Tested using Runtime 10 and 12.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group