2 weeks ago
Recently, when I was using databricks for deep learning I ran into an issue, i.e., after a certain amount of time of execution the cluster would break and restart. The logs are as below:
Specifically, my program prints a lot of content, and considering that all prints are logged in Driver logs, I suspect that the cluster breaks because of an OOM in the driver logs. So, I would like to know:
Looking forward to getting a reply from the experts, thank you very much!
2 weeks ago
The error message "echo: write error: no space left on device" indicates that the storage space for the driver logs might be full.
The default storage location for driver logs in Databricks is on the local disk of the driver node. However, the exact size limit can vary depending on the specific configuration of your Databricks environment and the type of cloud storage you're using.
The issue "Driver is up but is not responsive, likely due to GC" could indeed be due to memory limitations. Garbage Collection (GC) pauses can make the driver unresponsive if the system is trying to free up memory space. The link you provided does give an explanation related to job output limits, which might be related if your program is generating a large amount of output that is being logged.
Modifying the create cluster
→Advanced Options
→Logging
→Destination
to change the storage location for logs could potentially help solve this problem. You could consider directing the logs to a location with more available storage space.
a week ago
Thanks for your reply. However, although I modify the create cluster→Advanced Options→Logging→Destination to a destination, the "echo: write error: no space left on device" still appears. I change the destination to "/dbfs/FileStore", where the space is big enough. Can you help me? (Very distressed
Saturday
Hello Jaron, is it not possible for you to redirect the login to an ABFSS or S3 bucket?
Saturday
Hi, Walter_C, I have tried to redirect the .log file to other destination. However, I found that redirection through create cluster→Advanced Options→Logging→Destination is a copy rather than a move. This means that the driver log will still increase. (The Spark UI of databricks is useless and cannot display any valid information, including memory usage of drivers and executors.
Finally, I reluctantly switched to a larger driver to solve this problem.
Excited to expand your horizons with us? Click here to Register and begin your journey to success!
Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!