Hey @Mits11 — just a heads-up: Community Edition will sunset at the end of the year and will no longer be available after that point. The new home for users is Databricks Free Edition, which is where all future resources and support are being directed.
Community Edition is still accessible for now to give everyone time to migrate their work and assets over to Free Edition. I’d recommend making that move soon so you’re fully set up before the transition.
To answer your questions directly though, here’s what’s happening in both cases and how to verify or control it.
Why you saw 8 partitions for a single 181 MB CSV
- The spark.sql.files.maxPartitionBytes setting is an upper bound (default 128 MB), not a “target” count; Spark may create more than ceil(size/maxPartitionBytes) partitions depending on its file-scan logic and other advisory settings.
-
Spark also considers a suggested minimum via spark.sql.files.minPartitionNum (defaults to the cluster’s spark.default.parallelism). That can push the reader toward “at least” that many partitions on file-based inputs, even if a simple size/128 MB estimate would be lower.
-
In practice, with a single ~181 MB CSV and defaults, you can easily see 8 input partitions because the “minimum partitions” advisory aligns with the environment’s default parallelism (more on that below). This is consistent with guidance that large files are split into partitions around ~128 MB by default, but the actual count can be higher based on min partitions and split merging.
How to verify in your notebook:
python
spark.conf.get("spark.sql.files.maxPartitionBytes")
spark.conf.get("spark.sql.files.minPartitionNum")
sc.defaultParallelism
df = spark.read.option("header", "true").csv("/path/to/file.csv")
df.rdd.getNumPartitions
If you want exactly 2 partitions post-read:
- Use
df.repartition(2) (forces a shuffle, evenly redistributes).
- Or
df.coalesce(2) (no shuffle, merges existing partitions; better near the end of a pipeline if distribution is already balanced).
If you want fewer partitions at read-time (not guaranteed):
- Increase spark.sql.files.maxPartitionBytes (e.g., to 256 MB) and/or lower spark.sql.files.minPartitionNum; just note minPartitionNum is “suggested,” not strictly enforced.
Why the Spark UI shows “8 cores” on Community Edition
- In local mode, Spark’s “cores” in the UI represent the number of worker threads (task slots), not the physical CPU cores of your machine or VM.
-
Spark caps certain thread-related defaults with a hard limit of 8, and local[*] will use “up to all cores” but the effective concurrency often shows as 8 threads in the UI—hence the “Total Cores: 8” you’re seeing on CE even though your cluster page lists 2 cores.
-
On Databricks Community Edition specifically, you’ll typically observe spark.default.parallelism = 8 in local mode, which aligns with what the Spark UI displays as available task slots, again reflecting threads/concurrency rather than the physical core count.
What to check:
python
sc.master # often local[*] on CE
sc.defaultParallelism # commonly 8 on CE
spark.sparkContext.uiWebUrl
Quick takeaways
- “8 partitions” on read is normal with the defaults (128 MB max per partition plus a suggested minimum aligned to default parallelism). If you need a specific partition count, set it explicitly with
repartition/coalesce after reading.
- “8 cores” in the Spark UI on CE reflects Spark’s thread-based parallelism in local mode, not physical cores; the UI shows task slots/threads, and Spark may cap defaults at 8 for concurrency.
Hope this helps, Louis.