Databricks Community

dat_77 · ‎06-19-2024

HI
I attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it displayed 64. Interestingly, the job still consisting each stage having the default 200 tasks. How can I further increase this parallelism where is the value 200 taking from? The source is a python list of s3 files and json formatted.

irfan_elahi · ‎06-19-2024

sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle partitions when a shuffle is performed e.g. between stages. You can set it equal to sc.defaultParallelism if you want to increase parallelism in the shuffle tasks.

Databricks Community

Change Default Parallelism ?

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! October 31 – November 06, 2025

BrickTalks: Serve intelligence from your Lakehouse to your Apps with Lakebase

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse