Hi,
I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G.
My setup:
I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is the simplified version of my problem.
tables = ['/mnt/a', '/mnt/b', '/mnt/c'. ] # I have about 30 such tables.
for tbl in tables:
df = spark.read.parquet(tbl)
df.cache()
print(df.count())
Out of the 30 tables i load two of them have 20Million rows rest all are small.
Is there any reason why my driver memory goes up ?
Thanks
Ramz