cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Python notebook crashes with "The Python kernel is unresponsive"

TalY
New Contributor II

While using a Python notebook that works on my machine it crashes on the same point with the errors "The Python kernel is unresponsive" and "The Python process exited with exit code 134 (SIGABRT: Aborted).",  but with no stacktrace for debugging the issue in the notebook output or in the databricks cluster's logs (and no memory spikes in the monitoring). What can I do to debug this issue?

7 REPLIES 7

Kaniz_Fatma
Community Manager
Community Manager

Hi @TalY , 

โ€ข Troubleshooting steps for debugging a Python notebook crash:
 - Check for recent code changes or updates that may have caused the issue.
 - Look for pandas or collect operations that could be causing memory problems.
 - Monitor the memory usage of the driver node in an interactive cluster.
 - Review the code and check if the dataset size exceeds the available driver memory in a job cluster.
 - Consider the possibility of an ADF pipeline triggering the notebook and check for spot instance issues.
 - Take a heap dump of the driver and analyze it for memory leaks or excessive memory usage.
โ€ข Review notebook code, cluster configuration, and recent changes to identify the root cause.
โ€ข Reach out to Databricks support if the issue persists.
โ€ข Sources:
 https://docs.databricks.com/languages/pandas-spark.html
 https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

TalY
New Contributor II

I have been using Ganglia UI but I didn't see the memory running out, is it the correct way for monitoring memory usage? are there more options?

Kaniz_Fatma
Community Manager
Community Manager
Hi @TalY To monitor the memory usage of the driver node in the interactive cluster, check CPU, memory, disk I/O, and network I/O utilization metrics.
โ€ข Use Ganglia to check cluster health and download Spark logs for troubleshooting.
โ€ข Enable Heap Dump to capture a snapshot of the memory of the Java process.
โ€ข Use provided code to enable Heap Dump.
โ€ข Once the code is run, a .sh file named databricks_debug_script_collect_driver_stats.sh is created under the provided path.
โ€ข Point this script to the cluster init script parameter and restart the cluster.
โ€ข Driver heap dump will be generated in the specified path for monitoring and diagnosing memory usage of the driver node in the interactive cluster.

sean_owen
Honored Contributor II
Honored Contributor II

This is almost surely OOM. Yes you use the Metrics tab in the cluster UI to see memory usage. However, you may not observe memory usage is high before OOM - maybe something is allocating a huge amount of memory at once.

I think 90% of these issues are resolvable by code inspection. What step fails? is it pulling a bunch of stuff to the driver? are you allocating a huge dataset?

TalY
New Contributor II

I did notice a couple of times log messages in the driver's logs about memory allocation failure, so I tried 2 things one is to use smaller dataframe (from 200k rows to 10k) and the second is optimizing the use with pandas which did not help. After some searching over the weekend, I found that adding the following lines prevent it from crashing:

logging.getLogger("py4j").setLevel(logging.ERROR)
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)

And also I have been successfully running that notebook on my personal computer which has 32GB and the databricks driver is a "m5d.4xlarge" which has 64GB.

I prefer a cleaner solution ofc, so given this the direction of OOM is still the most probable one?

shan_chandra
Esteemed Contributor
Esteemed Contributor

@TalY  - Could you please let us know the DBR version used for running? Kindly try DBR 12.2 LTS or above. 

In order to debug this, there will be a hs_err_pid.log file provided with the problematic JVM details under the "python kernel unresponsive" error stack trace. 

TalY
New Contributor II

I am using the following DBR 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12).

Fatal error: The Python kernel is unresponsive.
--------------------------------------------------------------------------- The Python process exited with exit code 134 (SIGABRT: Aborted). --------------------------------------------------------------------------- The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs. --------------------------------------------------------------------------- Last messages on stderr: Tue Aug 1 18:02:57 2023 Connection to spark from PID 2632 Tue Aug 1 18:02:57 2023 Initialized gateway on port 45165 Tue Aug 1 18:02:57 2023 Connected to spark. [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [2023-08-01 18:06:22,007] [INFO] Received command c on object id p0 [2023-08-01 18:06:22,030] [INFO] Received command c on object id p0

And:
Last messages on stdout: NOTE: When using the `ipython kernel` entry point, Ctrl-C will not work. To exit, you will have to explicitly quit this process, by either sending "quit" from a client, or using Ctrl-\ in UNIX-like environments. To read more about this, see https://github.com/ipython/ipython/issues/2049

Those log lines led me in the direction of changing the log level

 

 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!