cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

High driver memory usage on loading parquet file

ramz
New Contributor II

Hi,

I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G.

My setup:

I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is the simplified version of my problem.

tables = ['/mnt/a', '/mnt/b', '/mnt/c'. ]  # I have about 30 such tables.
for tbl in tables:    
  df = spark.read.parquet(tbl)    
  df.cache()
  print(df.count())

Out of the 30 tables i load two of them have 20Million rows rest all are small.

Is there any reason why my driver memory goes up ?

Thanks

Ramz

4 REPLIES 4

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi,

Could you please confirm approximately the data size which is getting processed here and the DBR version along with the cluster config?

Also, you can refer to https://docs.databricks.com/clusters/cluster-config-best-practices.html to check on the cluster config best practices to tune in the best performance within the setup.

Please let us know if this helps. 

Also please tag @Debayan​ with your next response which will notify me, Thank you!

ramz
New Contributor II

Hi @Debayan Mukherjee​ ,

The amount of data getting processed is about 80 GB (all the tables combined). This amount of memory is available in Worker nodes. My concern is why driver memory is increasing. My understanding is that driver should not load any data at all. It there is no data getting loaded why there is jump in memory.

Thanks

Ramz

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Driver is responsible for running the workloads. The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors. Hence it may depend.

Anonymous
Not applicable

Hi @ramz siva​ 

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you. Thank you!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group