Re: Threads vs Processes (Parallel Programming) Da...

Wojciech_BUK · ‎05-06-2024

I am not super expert but I have been using databricks for a while and I can say that - when you use any Python library like asyncio, ThredPool and so one - this is good only to some maintenance things, small api calls etc.

When you want to leverage spark to do some job in parallel you should let spark natively handle parallelism.
E.g.

for parallel massive API call you can use dataframe and UDFs
to execute multiple spark jobs (like transform multiple tables) you can create one JOB with multiple tasks and reuse Shared Job Cluster (in Workflow experience)
Or event they implemented for each loop in workflow recently

Spark natively distribute dataframe partitions to workers and if there are resources left - it can handle another table/set of partitions.

Can you describe what you are trying to achieve when using python libraries to achieve parallelism ?