Databricks Community

956020 · ‎05-11-2023

Hello,

My problem:

I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note:

Fatal error: The Python kernel is unresponsive.

The Python process exited with exit code 134 (SIGABRT: Aborted).

Closing down clientserver connection

Assertion failed: ok (src/mailbox.cpp:99)

Fatal Python error: Aborted

While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers > 0.

If num_workers=0, the code runs fine.

This is the exact point of the crash:

> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()

1040 # before it starts, and __del__ tries to join but will get:

1041 # AssertionError: can only join a started process.

-> 1042 w.start()

Other things I tried :

I tried to change the pin_memory parameter - it crashes both when it is True and False.

Additionally, in another place of the code I have a multithreading process using Pool :

from multiprocessing.pool import Pool

And it also crashs with the message that the python kernel is unresponsive.

Removing this process to a single thread, also seems to solve the issue.

I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes.

Additional information:

This is my cluster details: m52xlarge, 32 GB Memory, 8 Cores

Databricks runtime version: 11.2 (includes Apache Spark 3.3.0, Scala 2.12)

Python version: 3.9

Pytorch version : 2.0.1+cu117

(I tried different clusters with more memory and it happened with all of them)

Any help with this will be appreciated 🙏

-werners- · ‎05-12-2023

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

View solution in original post

-werners- · ‎05-12-2023

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

Databricks Community

Issue with running multiprocessing on databricks: Python kernel is unresponsive error

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐