Run threadpool on multiple nodes

pjp94 — Wed, 05 Jul 2023 18:37:09 GMT

I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize to distribute workload to the worker nodes. I'm having a bit of trouble and would appreciate input on troubleshooting (code below). Here, I'm just trying to split a list of 100 values across 10 worker nodes and implement threading on each of those worker nodes.

from concurrent.futures import ThreadPoolExecutor def multithread(task, l): with ThreadPoolExecutor() as executor: results = list(executor.map(task, l)) return results def square(x): time.sleep(1) return x**2 def partition(l, n): # this function just partitions an input list into 'n' chunks for i in range(0, len(l), n): yield l[i:i +n] num = list(range(100)) workers = 10 chunks = list(partition(num, workers)) rdd = sc.parallelize(chunks, numSlices=workers) results = rdd.map(lambda x: multithread(x)).collect()

topic Run threadpool on multiple nodes in Data Engineering

Run threadpool on multiple nodes