cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

High driver memory usage on loading parquet file

ramz
New Contributor II

Hi,

I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G.

My setup:

I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is the simplified version of my problem.

tables = ['/mnt/a', '/mnt/b', '/mnt/c'. ]  # I have about 30 such tables.
for tbl in tables:    
  df = spark.read.parquet(tbl)    
  df.cache()
  print(df.count())

Out of the 30 tables i load two of them have 20Million rows rest all are small.

Is there any reason why my driver memory goes up ?

Thanks

Ramz

4 REPLIES 4

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi,

Could you please confirm approximately the data size which is getting processed here and the DBR version along with the cluster config?

Also, you can refer to https://docs.databricks.com/clusters/cluster-config-best-practices.html to check on the cluster config best practices to tune in the best performance within the setup.

Please let us know if this helps. 

Also please tag @Debayanโ€‹ with your next response which will notify me, Thank you!

ramz
New Contributor II

Hi @Debayan Mukherjeeโ€‹ ,

The amount of data getting processed is about 80 GB (all the tables combined). This amount of memory is available in Worker nodes. My concern is why driver memory is increasing. My understanding is that driver should not load any data at all. It there is no data getting loaded why there is jump in memory.

Thanks

Ramz

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Driver is responsible for running the workloads. The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors. Hence it may depend.

Anonymous
Not applicable

Hi @ramz sivaโ€‹ 

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you. Thank you!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.