cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Threads vs Processes (Parallel Programming) Databricks

Wolfoflag
New Contributor II

Hi Everyone,

I am trying to implement parallel processing in databricks and all the resources online point to using ThreadPool from the pythons multiprocessing.pool library or concurrent future library. These libraries offer methods for creating asynchronous threads and not actual parallel processes. I want know how to use multiprocessing.pool.map (or multiprocessing.process) to initialize workers and run true parallel processes.

I would also like to know how this works (creating and distributing processes rather than threads) under the hood in databricks considering the driver/worker node architecture of databricks. 

Finally is there a reason everyone creates threads rather than actual parallel processes in databricks. 

Thanks.

1 REPLY 1

Wojciech_BUK
Valued Contributor III

I am not super expert but I have been using databricks for a while and I can say that - when you use any Python library like asyncio, ThredPool and so one - this is good only to some maintenance things, small api calls etc.

When you want to leverage spark to do some job in parallel you should let spark natively handle parallelism.
E.g.

  • for parallel massive API call you can use dataframe and UDFs
  • to execute multiple spark jobs (like transform multiple tables) you can create one JOB with multiple tasks and reuse Shared Job Cluster (in Workflow experience) 
    Or event they implemented for each loop in workflow recently

Spark natively distribute dataframe partitions to workers and if there are resources left - it can handle another table/set of partitions.

Can you describe what you are trying to achieve when using python libraries to achieve parallelism ? 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group