cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Run threadpool on multiple nodes

pjp94
Contributor

I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize to distribute workload to the worker nodes. I'm having a bit of trouble and would appreciate input on troubleshooting (code below). Here, I'm just trying to split a list of 100 values across 10 worker nodes and implement threading on each of those worker nodes.

 

 

from concurrent.futures import ThreadPoolExecutor

def multithread(task, l):
   with ThreadPoolExecutor() as executor:
      results = list(executor.map(task, l))
   return results

def square(x):
   time.sleep(1)
   return x**2

def partition(l, n):
   # this function just partitions an input list into 'n' chunks
   for i in range(0, len(l), n):
      yield l[i:i +n]

num = list(range(100))
workers = 10
chunks = list(partition(num, workers))
rdd = sc.parallelize(chunks, numSlices=workers)
results = rdd.map(lambda x: multithread(x)).collect()

 

 

 

0 REPLIES 0