Databricks Python function achieving Parallelism
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-06-2024 02:40 PM
Hello everyone,
I have a very basic question wrt Databricks spark parallelism.
I have a python function within a for loop, so I believe this is running sequentially.
Databricks cluster is enabled with Photon and with Spark 15x, does that mean the driver is responsible to make this to run in parallel even though it is in a for loop OR do I need to introduce something to make the function to run in parallel.
Need you help to understand on the above one and if I need to introduce parallelism explicitly then how do I do it.
Also how to achieve it based on the total executors cores in the cluster [ I read executor cores are responsible for the parallelism ].
Please correct me if my understanding is wrong.
Thanks
Sathya
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-07-2024 05:06 AM - edited โ11-07-2024 05:06 AM
In Spark, the level of parallelism is determined by the number of partitions and the number of executor cores. Each task runs on a single core, so having more executor cores allows more tasks to run in parallel.
To achieve parallelism, you need to explicitly introduce parallel processing. Spark provides several ways to achieve parallelism, such as using map operations on RDDs or DataFrames, or using the concurrent.futures module in Python.
1) If your function can be applied to elements of an RDD or DataFrame, you can use Spark's map operation to run it in parallel across the cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-07-2024 02:20 PM
Thankyou for your reply. My code is not writing to any target it is actually doing a optimize and vaccum on all the tables based on catalog. Currently in for loop it is taking one table at a time and sequentially performing the actions.
Can this be parallelize using the concurrent.future module or is there any other ways of doing it.
Thanks
Sathya
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-10-2024 09:06 PM
any help here , thanks

