Databricks Community

956020 · ‎05-11-2023

Hello,

My problem:

I'm trying to run a pytorch code which include multiprocessing on databricks and mt code is crashing with the note:

Fatal error: The Python kernel is unresponsive.

The Python process exited with exit code 134 (SIGABRT: Aborted).

Closing down clientserver connection

Assertion failed: ok (src/mailbox.cpp:99)

Fatal Python error: Aborted

While trying to debug it, it seems like the code crashes when generating pytorch dataloader when num_workers > 0.

If num_workers=0, the code runs fine.

This is the exact point of the crash:

> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/torch/utils/data/dataloader.py(1042)__init__()

1040 # before it starts, and __del__ tries to join but will get:

1041 # AssertionError: can only join a started process.

-> 1042 w.start()

Other things I tried :

I tried to change the pin_memory parameter - it crashes both when it is True and False.

Additionally, in another place of the code I have a multithreading process using Pool :

from multiprocessing.pool import Pool

And it also crashs with the message that the python kernel is unresponsive.

Removing this process to a single thread, also seems to solve the issue.

I checked if this is a memory issue with monitoring the process and I haven't seen an issue there. I also tested the code with a larger cluster and it still crashes.

Additional information:

This is my cluster details: m52xlarge, 32 GB Memory, 8 Cores

Databricks runtime version: 11.2 (includes Apache Spark 3.3.0, Scala 2.12)

Python version: 3.9

Pytorch version : 2.0.1+cu117

(I tried different clusters with more memory and it happened with all of them)

Any help with this will be appreciated 🙏

-werners- · ‎05-12-2023

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

View solution in original post

-werners- · ‎05-12-2023

This is because multiprocessing will not use the distributed framework of spark/databricks.

When you use that, your code will run on the driver only and the workers are not doing anything.

More info here.

So you should use a spark-enabled ML library, like sparktorch.

Or do not use spark but Ray for example:

https://docs.databricks.com/machine-learning/ray-integration.html

Databricks Community

Issue with running multiprocessing on databricks: Python kernel is unresponsive error

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟