I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.
- local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means that the dataframe will have as many partitions as the number of cores available to the master, correct?
- When I run with local[*], I get 12 as the defaultParallelism for SparkContext. When I run with any number like local[2],local[4] etc. then I get the same number as the defaultParallelism for SparkContext. Why is that? Does Spark calculate defaultParallelism as total number of cores * 2 when it sees master as local[*] otherwise it keeps it equal to the input that I gave?
- With or without Adaptive Query Execution enabled, I see number of partitions in my dataframe as 12 when master is local[*] whereas it's 7 when master is local[4]. I thought AQE is to decide correct number of partitions in Spark 3. Is that not right? Why does it take 7 as the number of partitions when master is local[4]?
- I am able to print Spark version and sc.master even after stopping spark or sc. Why? Shouldn't I get an error that the connection is no longer active?
Below is the code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
print("Spark Version:",spark.version)
sc = spark.sparkContext
print('Master :',sc.master)
print('Default Parallelism :',sc.defaultParallelism)
print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))
spark.conf.set('spark.sql.adaptive.enabled','true')
print('AQE Enabled :',spark.conf.get('spark.sql.adaptive.enabled'))
df = spark.read.load("/Users/user/xyz.csv",
format="csv", sep=",", inferSchema="true", header="true")
print('No of partitions in the dataframe :',df.rdd.getNumPartitions())
print(sc.uiWebUrl)
spark.stop()
print("Spark Version:",spark.version)
print('Master :',sc.master)