cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with running multiprocessing on databricks: Python kernel is unresponsive error

956020
New Contributor II

Hello,

My problem:

I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note:

Fatal error: The Python kernel is unresponsive.

The Python process exited with exit code 134 (SIGABRT: Aborted).

Closing down clientserver connection

Assertion failed: ok (src/mailbox.cpp:99)

Fatal Python error: Aborted

While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers > 0. 

If num_workers=0, the code runs fine. 

This is the exact point of the crash: 

> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()

1040 # before it starts, and __del__ tries to join but will get:

1041 # AssertionError: can only join a started process.

-> 1042 w.start()

Other things I tried : 

I tried to change the pin_memory parameter - it crashes both when it is True and False. 

Additionally, in another place of the code I have a multithreading process using Pool :

from multiprocessing.pool import Pool

And it also crashs with the message that the python kernel is unresponsive. 

Removing this process to a single thread, also seems to solve the issue. 

I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes. 

Additional information: 

This is my cluster details: m52xlarge, 32 GB Memory, 8 Cores

Databricks runtime version: 11.2 (includes Apache Spark 3.3.0, Scala 2.12)

Python version: 3.9

Pytorch version : 2.0.1+cu117

(I tried different clusters with more memory and it happened with all of them)

Any help with this will be appreciated 🙏

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

View solution in original post

1 REPLY 1

-werners-
Esteemed Contributor III

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.