Databricks Community

ramz · ‎03-07-2023

Hi,

I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G.

My setup:

I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is the simplified version of my problem.

tables = ['/mnt/a', '/mnt/b', '/mnt/c'. ]  # I have about 30 such tables.
for tbl in tables:    
  df = spark.read.parquet(tbl)    
  df.cache()
  print(df.count())

Out of the 30 tables i load two of them have 20Million rows rest all are small.

Is there any reason why my driver memory goes up ?

Thanks

Ramz

Debayan · ‎03-08-2023

Hi,

Could you please confirm approximately the data size which is getting processed here and the DBR version along with the cluster config?

Also, you can refer to https://docs.databricks.com/clusters/cluster-config-best-practices.html to check on the cluster config best practices to tune in the best performance within the setup.

Please let us know if this helps.

Also please tag @Debayan with your next response which will notify me, Thank you!

ramz · ‎03-12-2023

Hi @Debayan Mukherjee ,

The amount of data getting processed is about 80 GB (all the tables combined). This amount of memory is available in Worker nodes. My concern is why driver memory is increasing. My understanding is that driver should not load any data at all. It there is no data getting loaded why there is jump in memory.

Thanks

Ramz

Debayan · ‎03-12-2023

Hi, Driver is responsible for running the workloads. The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors. Hence it may depend.

Anonymous · ‎03-31-2023

Hi @ramz siva

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you. Thank you!

Databricks Community

High driver memory usage on loading parquet file

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples