cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Understanding Partitions in Spark Local Mode

Personal1
New Contributor II

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.

  1. local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means that the dataframe will have as many partitions as the number of cores available to the master, correct?
  2. When I run with local[*], I get 12 as the defaultParallelism for SparkContext. When I run with any number like local[2],local[4] etc. then I get the same number as the defaultParallelism for SparkContext. Why is that? Does Spark calculate defaultParallelism as total number of cores * 2 when it sees master as local[*] otherwise it keeps it equal to the input that I gave?
  3. With or without Adaptive Query Execution enabled, I see number of partitions in my dataframe as 12 when master is local[*] whereas it's 7 when master is local[4]. I thought AQE is to decide correct number of partitions in Spark 3. Is that not right? Why does it take 7 as the number of partitions when master is local[4]?
  4. I am able to print Spark version and sc.master even after stopping spark or sc. Why? Shouldn't I get an error that the connection is no longer active?

Below is the code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()

print("Spark Version:",spark.version)

sc = spark.sparkContext

print('Master :',sc.master)

print('Default Parallelism :',sc.defaultParallelism)

print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))

spark.conf.set('spark.sql.adaptive.enabled','true')

print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))

df = spark.read.load("/Users/user/xyz.csv",

format="csv", sep=",", inferSchema="true", header="true")

print('No of partitions in the dataframe :',df.rdd.getNumPartitions())

print(sc.uiWebUrl)

spark.stop()

print("Spark Version:",spark.version)

print('Master :',sc.master)

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

That is a lot of questions in one topic.

Let's give it a try:

[1] this all depends on the values of the concerning parameters and the program you run

(think joins, unions, repartition etc)

[2] spark.default.parallelism is by default the number of cores * 2

[3] Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

AQE does not just decide the number of partitions.

https://spark.apache.org/docs/latest/sql-performance-tuning.html

[4] no idea, perhaps it is buffered/cached somewhere

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

Hello there,

My name is Piper and I'm one of the community moderators. Thank you for your questions. They look like good ones! Let's see how the community responds first and then we'll see if we need the team to follow up.

-werners-
Esteemed Contributor III

That is a lot of questions in one topic.

Let's give it a try:

[1] this all depends on the values of the concerning parameters and the program you run

(think joins, unions, repartition etc)

[2] spark.default.parallelism is by default the number of cores * 2

[3] Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

AQE does not just decide the number of partitions.

https://spark.apache.org/docs/latest/sql-performance-tuning.html

[4] no idea, perhaps it is buffered/cached somewhere

Kaniz_Fatma
Community Manager
Community Manager

Hi @Abhishek Pradhan​ , Just a friendly follow-up. Do you still need help, or @Werner Stinckens​ 's response help you to find the solution? Please let us know.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group