High driver memory usage on loading parquet file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2023 12:40 AM
Hi,
I am using pyspark and i am reading a bunch of parquet files and doing the count on each of them. Driver memory shoots up about 6G to 8G.
My setup:
I have a cluster of 1 driver node and 2 worker node (all of them 16 core 128 GB RAM). This is the simplified version of my problem.
tables = ['/mnt/a', '/mnt/b', '/mnt/c'. ] # I have about 30 such tables.
for tbl in tables:
df = spark.read.parquet(tbl)
df.cache()
print(df.count())
Out of the 30 tables i load two of them have 20Million rows rest all are small.
Is there any reason why my driver memory goes up ?
Thanks
Ramz
- Labels:
-
Parquet File
-
Parquet files
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 10:32 PM
Hi,
Could you please confirm approximately the data size which is getting processed here and the DBR version along with the cluster config?
Also, you can refer to https://docs.databricks.com/clusters/cluster-config-best-practices.html to check on the cluster config best practices to tune in the best performance within the setup.
Please let us know if this helps.
Also please tag @Debayan with your next response which will notify me, Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2023 03:32 AM
Hi @Debayan Mukherjee ,
The amount of data getting processed is about 80 GB (all the tables combined). This amount of memory is available in Worker nodes. My concern is why driver memory is increasing. My understanding is that driver should not load any data at all. It there is no data getting loaded why there is jump in memory.
Thanks
Ramz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2023 11:06 PM
Hi, Driver is responsible for running the workloads. The driver node maintains state information of all notebooks attached to the cluster. The driver node also maintains the SparkContext, interprets all the commands you run from a notebook or a library on the cluster, and runs the Apache Spark master that coordinates with the Spark executors. Hence it may depend.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2023 05:57 PM
Hi @ramz siva
Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.
Please help us select the best solution by clicking on "Select As Best" if it does.
Your feedback will help us ensure that we are providing the best possible service to you. Thank you!