Change Default Parallelism ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-19-2024 06:59 PM
HI
I attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it displayed 64. Interestingly, the job still consisting each stage having the default 200 tasks. How can I further increase this parallelism where is the value 200 taking from? The source is a python list of s3 files and json formatted.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-19-2024 11:28 PM
sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle partitions when a shuffle is performed e.g. between stages. You can set it equal to sc.defaultParallelism if you want to increase parallelism in the shuffle tasks.

