Hi Everyone,
I am trying to implement parallel processing in databricks and all the resources online point to using ThreadPool from the pythons multiprocessing.pool library or concurrent future library. These libraries offer methods for creating asynchronous threads and not actual parallel processes. I want know how to use multiprocessing.pool.map (or multiprocessing.process) to initialize workers and run true parallel processes.
I would also like to know how this works (creating and distributing processes rather than threads) under the hood in databricks considering the driver/worker node architecture of databricks.
Finally is there a reason everyone creates threads rather than actual parallel processes in databricks.
Thanks.