- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-06-2024 03:49 AM
Hello,
Thank you for your answer. It seems to me that the reply is GPT answer. I would expect an answer from community as a person as I have tried to solve the issue with GPT already.
Nevertheless:
1) Initial Memory Allocation: Adjusting memory configuration might be a solution but my question here is that how I can do that, based on what metrics? What is the technical explanation of the issue and solution?
2) Memory Consumption with Dataframes: I am training a ML model with Logistic Regression and LightGBM with Optuna. PySpark does not provide the configuration of these ML models and hyperparam optimization so I must do toPandas() conversion and use scikit-learn and lightgbm libraries.
3) GC (Allocation Failure): Could you please provide a documentation, blog, book or any feature implementation regarding all of these so I can understand the underlying issue here?
After talking with Databricks Core Team, firstly, I was told that problem is not memory but networking issue:
"The network issue had caused the driver's IP to be out of reach, and hence, the Chauffeur assumed that the driver was dead, marked it as dead and restarted a new driver PID. Since a driver was restarted, the job failed and it should be temporary."
The problem is not temporary and it happens in irregular intervals.
For LightGBM training these are the parameters I am trying with Optuna:
I have seen that playing with n_jobs=1 or n_jobs=5 helped me to reduce the rate of error happening in my trials. However, I have observed that when n_jobs=1, jobs with smalller dataset(~150MB) finish faster compared n_jobs=5 where cross validation should be parallel and faster, which is an unexpected case. When I set n_jobs more than 1, seeing the error chance incrases.
I believe the error is coming from the threading with Optuna and LightGBM (same happens in the Logreg) now. I wonder somehow Optuna(3.5.0), lightgbm(4.3.0) and joblib(1.2.0) libraries creating the problem in the runtime. I am still keep seeing the GC during the runs as I expect them to see because I am using Optuna.study.optimize function with
I would literally appreciate a lot from the community if someone has an answer for this. I am willing to have a meeting and talk with anyone at this point.
Thanks!