I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize to distribute workload to the worker nodes. I'm having a bit of trouble and would appreciate input on troubleshooting (code below). Here, I'm just trying to split a list of 100 values across 10 worker nodes and implement threading on each of those worker nodes.
from concurrent.futures import ThreadPoolExecutor
def multithread(task, l):
with ThreadPoolExecutor() as executor:
results = list(executor.map(task, l))
return results
def square(x):
time.sleep(1)
return x**2
def partition(l, n):
# this function just partitions an input list into 'n' chunks
for i in range(0, len(l), n):
yield l[i:i +n]
num = list(range(100))
workers = 10
chunks = list(partition(num, workers))
rdd = sc.parallelize(chunks, numSlices=workers)
results = rdd.map(lambda x: multithread(x)).collect()