cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why Pandas on Spark can trigger `Driver is up but is not responsive, likely due to GC` ?

Dicer
Valued Contributor

I am using the distributed Pandas on Spark, not the single node Pandas.

But when I try to run the following code to transform a data frame with 652 x 729803 data points

 

 

df_ps_pct = df.pandas_api().pct_change().to_spark()

 

 

, it returns me this error: Driver is up but is not responsive, likely due to GC.

I already follow this guide Spark job fails with Driver is temporarily unavailable - Databricks to stop using single node Pandas.

My ultimate goal is to calculate the `pct_change()` on the spark data frame.

However, as Spark does not have `pct_change()`, so I change the Spark data frame to Pandas on Spark first and then I change it back to Spark.

 

 

 

3 REPLIES 3

anardinelli
New Contributor III
New Contributor III

Hi @Dicer how are you?

This might be happening due to difficulties handling the Garbage Collection process, which is related to a memory load bottleneck. Have you trying increasing your driver memory size?

You can also relate to this article, which might be helpfull: https://kb.databricks.com/en_US/jobs/driver-unavailable

Best,

Alessandro

Dicer
Valued Contributor

Hi @anardinelli 

Thank your for your reply. This is indeed one of the solutions.

However, my driver used memory size is always not always constant. When it is doing specific code operation, it triggers the above error: Driver is up but is not responsive, likely due to GC.

My expectation is: not to run all the code operations in just one single node.

If I increase the driver memory size, do I need to increase the worker memory size as well?

Looking forward to your reply! Thank you once again!

anardinelli
New Contributor III
New Contributor III

@Hi @Dicer 

I don't think you have a problem with the workers, since you are running distributed Pandas, work is going to be paralleled either way. When the data is collected back to the Driver, then it might be overloaded (since the Driver has to collect everything back)

If you can, please share also the stack trace of the entire error message so I can better take a look.

Increasing only the Driver memory should be enough.

Best,

Alessandro

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!