Re: Cluster Memory Issue (Termination)

egndz · ‎06-06-2024

Hello,

Thank you for your answer. It seems to me that the reply is GPT answer. I would expect an answer from community as a person as I have tried to solve the issue with GPT already.

Nevertheless:

1) Initial Memory Allocation: Adjusting memory configuration might be a solution but my question here is that how I can do that, based on what metrics? What is the technical explanation of the issue and solution?

2) Memory Consumption with Dataframes: I am training a ML model with Logistic Regression and LightGBM with Optuna. PySpark does not provide the configuration of these ML models and hyperparam optimization so I must do toPandas() conversion and use scikit-learn and lightgbm libraries.

3) GC (Allocation Failure): Could you please provide a documentation, blog, book or any feature implementation regarding all of these so I can understand the underlying issue here?

After talking with Databricks Core Team, firstly, I was told that problem is not memory but networking issue:

"The network issue had caused the driver's IP to be out of reach, and hence, the Chauffeur assumed that the driver was dead, marked it as dead and restarted a new driver PID. Since a driver was restarted, the job failed and it should be temporary."

The problem is not temporary and it happens in irregular intervals.

For LightGBM training these are the parameters I am trying with Optuna:

I have seen that playing with n_jobs=1 or n_jobs=5 helped me to reduce the rate of error happening in my trials. However, I have observed that when n_jobs=1, jobs with smalller dataset(~150MB) finish faster compared n_jobs=5 where cross validation should be parallel and faster, which is an unexpected case. When I set n_jobs more than 1, seeing the error chance incrases.

I believe the error is coming from the threading with Optuna and LightGBM (same happens in the Logreg) now. I wonder somehow Optuna(3.5.0), lightgbm(4.3.0) and joblib(1.2.0) libraries creating the problem in the runtime. I am still keep seeing the GC during the runs as I expect them to see because I am using Optuna.study.optimize function with

gc_after_trial=True .

I would literally appreciate a lot from the community if someone has an answer for this. I am willing to have a meeting and talk with anyone at this point.

Thanks!