Why Pandas on Spark can trigger `Driver is up but is not responsive, likely due to GC` ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-24-2024 12:05 PM - edited 05-24-2024 12:08 PM
I am using the distributed Pandas on Spark, not the single node Pandas.
But when I try to run the following code to transform a data frame with 652 x 729803 data points
df_ps_pct = df.pandas_api().pct_change().to_spark()
, it returns me this error: Driver is up but is not responsive, likely due to GC.
I already follow this guide Spark job fails with Driver is temporarily unavailable - Databricks to stop using single node Pandas.
My ultimate goal is to calculate the `pct_change()` on the spark data frame.
However, as Spark does not have `pct_change()`, so I change the Spark data frame to Pandas on Spark first and then I change it back to Spark.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2024 07:50 AM
Hi @Dicer how are you?
This might be happening due to difficulties handling the Garbage Collection process, which is related to a memory load bottleneck. Have you trying increasing your driver memory size?
You can also relate to this article, which might be helpfull: https://kb.databricks.com/en_US/jobs/driver-unavailable
Best,
Alessandro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2024 06:27 PM
Hi @anardinelli
Thank your for your reply. This is indeed one of the solutions.
However, my driver used memory size is always not always constant. When it is doing specific code operation, it triggers the above error: Driver is up but is not responsive, likely due to GC.
My expectation is: not to run all the code operations in just one single node.
If I increase the driver memory size, do I need to increase the worker memory size as well?
Looking forward to your reply! Thank you once again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-28-2024 07:16 AM
@Hi @Dicer
I don't think you have a problem with the workers, since you are running distributed Pandas, work is going to be paralleled either way. When the data is collected back to the Driver, then it might be overloaded (since the Driver has to collect everything back)
If you can, please share also the stack trace of the entire error message so I can better take a look.
Increasing only the Driver memory should be enough.
Best,
Alessandro

