cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with running multiprocessing on databricks: Python kernel is unresponsive error

956020
New Contributor II

Hello,

My problem:

I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note:

Fatal error: The Python kernel is unresponsive.

The Python process exited with exit code 134 (SIGABRT: Aborted).

Closing down clientserver connection

Assertion failed: ok (src/mailbox.cpp:99)

Fatal Python error: Aborted

While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers > 0. 

If num_workers=0, the code runs fine. 

This is the exact point of the crash: 

> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()

1040 # before it starts, and __del__ tries to join but will get:

1041 # AssertionError: can only join a started process.

-> 1042 w.start()

Other things I tried : 

I tried to change the pin_memory parameter - it crashes both when it is True and False. 

Additionally, in another place of the code I have a multithreading process using Pool :

from multiprocessing.pool import Pool

And it also crashs with the message that the python kernel is unresponsive. 

Removing this process to a single thread, also seems to solve the issue. 

I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes. 

Additional information: 

This is my cluster details: m52xlarge, 32 GB Memory, 8 Cores

Databricks runtime version: 11.2 (includes Apache Spark 3.3.0, Scala 2.12)

Python version: 3.9

Pytorch version : 2.0.1+cu117

(I tried different clusters with more memory and it happened with all of them)

Any help with this will be appreciated 🙏

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

View solution in original post

1 REPLY 1

-werners-
Esteemed Contributor III

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group