Databricks Community

dat_77 · ‎06-19-2024

HI
I attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it displayed 64. Interestingly, the job still consisting each stage having the default 200 tasks. How can I further increase this parallelism where is the value 200 taking from? The source is a python list of s3 files and json formatted.

irfan_elahi · ‎06-19-2024

sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle partitions when a shuffle is performed e.g. between stages. You can set it equal to sc.defaultParallelism if you want to increase parallelism in the shuffle tasks.

Databricks Community

Change Default Parallelism ?

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs