Databricks Community

TalY · ‎08-02-2023

While using a Python notebook that works on my machine it crashes on the same point with the errors "The Python kernel is unresponsive" and "The Python process exited with exit code 134 (SIGABRT: Aborted).", but with no stacktrace for debugging the issue in the notebook output or in the databricks cluster's logs (and no memory spikes in the monitoring). What can I do to debug this issue?

Kaniz_Fatma · ‎08-02-2023

Hi @TalY ,

• Troubleshooting steps for debugging a Python notebook crash:
- Check for recent code changes or updates that may have caused the issue.
- Look for pandas or collect operations that could be causing memory problems.
- Monitor the memory usage of the driver node in an interactive cluster.
- Review the code and check if the dataset size exceeds the available driver memory in a job cluster.
- Consider the possibility of an ADF pipeline triggering the notebook and check for spot instance issues.
- Take a heap dump of the driver and analyze it for memory leaks or excessive memory usage.
• Review notebook code, cluster configuration, and recent changes to identify the root cause.
• Reach out to Databricks support if the issue persists.
• Sources:
- https://docs.databricks.com/languages/pandas-spark.html
- https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html

TalY · ‎08-02-2023

I have been using Ganglia UI but I didn't see the memory running out, is it the correct way for monitoring memory usage? are there more options?

Kaniz_Fatma · ‎08-02-2023

Hi @TalY, To monitor the memory usage of the driver node in the interactive cluster, check CPU, memory, disk I/O, and network I/O utilization metrics.
• Use Ganglia to check cluster health and download Spark logs for troubleshooting.
• Enable Heap Dump to capture a snapshot of the memory of the Java process.
• Use provided code to enable Heap Dump.
• Once the code is run, a .sh file named databricks_debug_script_collect_driver_stats.sh is created under the provided path.
• Point this script to the cluster init script parameter and restart the cluster.
• Driver heap dump will be generated in the specified path for monitoring and diagnosing memory usage of the driver node in the interactive cluster.

sean_owen · ‎08-04-2023

This is almost surely OOM. Yes you use the Metrics tab in the cluster UI to see memory usage. However, you may not observe memory usage is high before OOM - maybe something is allocating a huge amount of memory at once.

I think 90% of these issues are resolvable by code inspection. What step fails? is it pulling a bunch of stuff to the driver? are you allocating a huge dataset?

TalY · ‎08-06-2023

I did notice a couple of times log messages in the driver's logs about memory allocation failure, so I tried 2 things one is to use smaller dataframe (from 200k rows to 10k) and the second is optimizing the use with pandas which did not help. After some searching over the weekend, I found that adding the following lines prevent it from crashing:

logging.getLogger("py4j").setLevel(logging.ERROR)
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)

And also I have been successfully running that notebook on my personal computer which has 32GB and the databricks driver is a "m5d.4xlarge" which has 64GB.

I prefer a cleaner solution ofc, so given this the direction of OOM is still the most probable one?

shan_chandra · ‎08-06-2023

@TalY - Could you please let us know the DBR version used for running? Kindly try DBR 12.2 LTS or above.

In order to debug this, there will be a hs_err_pid.log file provided with the problematic JVM details under the "python kernel unresponsive" error stack trace.

TalY · ‎08-07-2023

I am using the following DBR 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12).

Fatal error: The Python kernel is unresponsive.

--------------------------------------------------------------------------- The Python process exited with exit code 134 (SIGABRT: Aborted). --------------------------------------------------------------------------- The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs. --------------------------------------------------------------------------- Last messages on stderr: Tue Aug 1 18:02:57 2023 Connection to spark from PID 2632 Tue Aug 1 18:02:57 2023 Initialized gateway on port 45165 Tue Aug 1 18:02:57 2023 Connected to spark. [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [IPKernelApp] WARNING | No such comm: LSP_COMM_ID [2023-08-01 18:06:22,007] [INFO] Received command c on object id p0 [2023-08-01 18:06:22,030] [INFO] Received command c on object id p0

And:

Last messages on stdout: NOTE: When using the `ipython kernel` entry point, Ctrl-C will not work. To exit, you will have to explicitly quit this process, by either sending "quit" from a client, or using Ctrl-\ in UNIX-like environments. To read more about this, see https://github.com/ipython/ipython/issues/2049

Those log lines led me in the direction of changing the log level

Databricks Community

Python notebook crashes with "The Python kernel is unresponsive"

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI

Announcing Mosaic AI Agent Framework and Agent Evaluation