Databricks Community

dat_77 · ‎06-19-2024

HI
I attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it displayed 64. Interestingly, the job still consisting each stage having the default 200 tasks. How can I further increase this parallelism where is the value 200 taking from? The source is a python list of s3 files and json formatted.

irfan_elahi · ‎06-19-2024

sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle partitions when a shuffle is performed e.g. between stages. You can set it equal to sc.defaultParallelism if you want to increase parallelism in the shuffle tasks.

Databricks Community

Change Default Parallelism ?

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years