Databricks Community

Personal1 · ‎10-13-2021

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.

local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means that the dataframe will have as many partitions as the number of cores available to the master, correct?
When I run with local[*], I get 12 as the defaultParallelism for SparkContext. When I run with any number like local[2],local[4] etc. then I get the same number as the defaultParallelism for SparkContext. Why is that? Does Spark calculate defaultParallelism as total number of cores * 2 when it sees master as local[*] otherwise it keeps it equal to the input that I gave?
With or without Adaptive Query Execution enabled, I see number of partitions in my dataframe as 12 when master is local[*] whereas it's 7 when master is local[4]. I thought AQE is to decide correct number of partitions in Spark 3. Is that not right? Why does it take 7 as the number of partitions when master is local[4]?
I am able to print Spark version and sc.master even after stopping spark or sc. Why? Shouldn't I get an error that the connection is no longer active?

Below is the code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()

print("Spark Version:",spark.version)

sc = spark.sparkContext

print('Master :',sc.master)

print('Default Parallelism :',sc.defaultParallelism)

print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))

spark.conf.set('spark.sql.adaptive.enabled','true')

print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))

df = spark.read.load("/Users/user/xyz.csv",

format="csv", sep=",", inferSchema="true", header="true")

print('No of partitions in the dataframe :',df.rdd.getNumPartitions())

print(sc.uiWebUrl)

spark.stop()

print("Spark Version:",spark.version)

print('Master :',sc.master)

-werners- · ‎10-15-2021

That is a lot of questions in one topic.

Let's give it a try:

[1] this all depends on the values of the concerning parameters and the program you run

(think joins, unions, repartition etc)

[2] spark.default.parallelism is by default the number of cores * 2

[3] Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan.

AQE does not just decide the number of partitions.

https://spark.apache.org/docs/latest/sql-performance-tuning.html

[4] no idea, perhaps it is buffered/cached somewhere

View solution in original post

Anonymous · ‎10-14-2021

Hello there,

My name is Piper and I'm one of the community moderators. Thank you for your questions. They look like good ones! Let's see how the community responds first and then we'll see if we need the team to follow up.

-werners- · ‎10-15-2021