cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Run threadpool on multiple nodes

pjp94
Contributor

I've ran a dual multiprocessing and multithreading solution in python before using the multiprocessing and concurrent futures python modules. However, since the multiprocessing module only runs on the driver node, I have to instead use sc.parallelize to distribute workload to the worker nodes. I'm having a bit of trouble and would appreciate input on troubleshooting (code below). Here, I'm just trying to split a list of 100 values across 10 worker nodes and implement threading on each of those worker nodes.

 

 

from concurrent.futures import ThreadPoolExecutor

def multithread(task, l):
   with ThreadPoolExecutor() as executor:
      results = list(executor.map(task, l))
   return results

def square(x):
   time.sleep(1)
   return x**2

def partition(l, n):
   # this function just partitions an input list into 'n' chunks
   for i in range(0, len(l), n):
      yield l[i:i +n]

num = list(range(100))
workers = 10
chunks = list(partition(num, workers))
rdd = sc.parallelize(chunks, numSlices=workers)
results = rdd.map(lambda x: multithread(x)).collect()

 

 

 

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group