Hello,
My problem:
I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note:
Fatal error: The Python kernel is unresponsive.
The Python process exited with exit code 134 (SIGABRT: Aborted).
Closing down clientserver connection
Assertion failed: ok (src/mailbox.cpp:99)
Fatal Python error: Aborted
While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers > 0.
If num_workers=0, the code runs fine.
This is the exact point of the crash:
> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()
1040 # before it starts, and __del__ tries to join but will get:
1041 # AssertionError: can only join a started process.
-> 1042 w.start()
Other things I tried :
I tried to change the pin_memory parameter - it crashes both when it is True and False.
Additionally, in another place of the code I have a multithreading process using Pool :
from multiprocessing.pool import Pool
And it also crashs with the message that the python kernel is unresponsive.
Removing this process to a single thread, also seems to solve the issue.
I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes.
Additional information:
This is my cluster details: m52xlarge, 32 GB Memory, 8 Cores
Databricks runtime version: 11.2 (includes Apache Spark 3.3.0, Scala 2.12)
Python version: 3.9
Pytorch version : 2.0.1+cu117
(I tried different clusters with more memory and it happened with all of them)
Any help with this will be appreciated 🙏