topic Re: Fatal error: Python kernel is unresponsive in Data Engineering

Fatal error: Python kernel is unresponsive

Orianh — Wed, 07 Sep 2022 08:03:07 GMT

Hey guys,

I'm using petastorm to train DNN, First i convert spark df with make_spark_convertor and then open a reader on the materialized dataset.

While i start training session only on subset of the data every thing works fine but when I'm using all dataset after about 500 batches my notebook crash with Python kernel is unresponsive, any of you know what this happening?

I saw kinda similar question already and i looked on thread dumps but didn't understood it to much.

Besides i get alot of future warning from petastorm about pyarrow, have any idea how to avoid all this warnings?

Re: Fatal error: Python kernel is unresponsive

230134 — Thu, 08 Sep 2022 07:14:36 GMT

Same error. This started a few days ago on notebooks that used to run fine in the past. Now, I cannot finish a notebook.

I have already disabled almost all output being streamed to the result buffer, but the problem persists. I am left with <50 lines being logged/printed. If Databricks cannot handle such a minimal amount of output, it's not a usable solution.

Re: Fatal error: Python kernel is unresponsive

230134 — Thu, 08 Sep 2022 14:13:12 GMT

In my case, this turned out to be a memory issue. For whatever reason, Databricks doesn't properly raise a MemoryError. So you're kind of left hanging and have to figure it out yourself.

Re: Fatal error: Python kernel is unresponsive

susodapop — Fri, 09 Sep 2022 15:19:58 GMT

Thanks for sharing your findings. How did you determine this was a `MemoryError`?

Re: Fatal error: Python kernel is unresponsive

230134 — Fri, 09 Sep 2022 15:39:25 GMT

I opened the terminal to the cluster and just monitored htop. I could see memory usage going up, hitting the limit, going into swap, and then dropping to a base level at the same time as the FatalError was raised.

Re: Fatal error: Python kernel is unresponsive

ilvacca — Tue, 13 Sep 2022 13:20:45 GMT

I also noticed the same behavior. How can we handle such a problem in your opinion? It would take something to manage the RAM...

Re: Fatal error: Python kernel is unresponsive

Orianh — Wed, 14 Sep 2022 09:46:57 GMT

Hey Guys,

While i was training i noticed two things that might cause the error.

The first one is after a training session was crashed, the GPU memory was almost full ( checked with nvidia smi command).

The second one is that i saw in gangila metrics a Swap above the total memory of the cluster.

In my use case i use make_reader from petastorm to read petastorm dataset and its default workers_count is 10, While i changed workers_count to 4 I didn't got any error.

I didn't figure out if I'm truly right and what the right way to overcome this,

Would like to hear you opnion,

Thanks!

Re: Fatal error: Python kernel is unresponsive

ilvacca — Wed, 14 Sep 2022 09:52:02 GMT

In my case I use a simple notebook with an OpenCV processing. The code is not yet optimized to run on a cluster (I use a Single Node for testing coupled with Synapse) however it seems absurd to me that the kernel crashes due to RAM filling up (I verified this via the cluster monitoring panel).

Do you think it is possible to define a "max RAM usage" per notebook somewhere?

Re: Fatal error: Python kernel is unresponsive

Anonymous — Sat, 24 Sep 2022 06:04:51 GMT

Hi @orian hindi

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Re: Fatal error: Python kernel is unresponsive

Dicer — Wed, 05 Oct 2022 08:58:09 GMT

I also have the same problem.

Before `Fatal error: Python kernel is unresponsive`, the process `Determining location of DBIO file fragments. This operation can take some time` took me 6.92 hours.

I want to know whether this is normal.

@Vidula Khanna

Re: Fatal error: Python kernel is unresponsive

rubelahamed — Fri, 07 Oct 2022 16:06:25 GMT

Hey Guys,

While I was training I noticed two things that might cause the error.

The first one is after a training session was crashed, the GPU memory was almost full ( checked with Nvidia semi command).

The second one is that I saw in ganglia metrics a Swap above the total memory of

Re: Fatal error: Python kernel is unresponsive

Orianh — Tue, 25 Oct 2022 17:17:28 GMT

Hey @Alessio Vaccaro , Sorry for the really delayed response 😅

I didn't find any documentation or any good resource of this.

I would hope that if only 1 notebook is attached to a cluster, this notebook can use all the RAM - memory allocated for spark driver, when more notebooks are attached then some mechanism to handle it start to work.

Actually i saw a databricks blog that say "Fatal error: The Python kernel is unresponsive." is an error cause because out of RAM

you can see the blog here:

Accelerating Your Deep Learning with PyTorch Lightning on Databricks - The Databricks Blog

Re: Fatal error: Python kernel is unresponsive

Orianh — Tue, 25 Oct 2022 17:27:52 GMT

Hey @Cheuk Hin Christophe Poon , I don't know if you managed to solve this issue.

I saw in Databricks blog that this error is caused by out of RAM issue, link here

Besides, When i tried to run my notebook from a job, not just that the run finished without any errors but also the RAM that was being used cut down by half - maybe you should give it a try if didn't managed yet.

I think that when you run the code inside the notebook a lot of state is saved any fill up the RAM ( It's just a feeling I didn't confirm that).

Re: Fatal error: Python kernel is unresponsive

Orianh — Tue, 25 Oct 2022 17:39:07 GMT

Hey @Vidula Khanna

I found a workaround, I created a job that run the notebook. ( with cluster spec and not with existing cluster - which cost cheaper)

I think when a notebook is attached to existing cluster a lot of state of it saved which fill the RAM or there is some mechanism that starts to work on allocating memory to this and any other notebook that might come ,

When i run the notebook from a job, the memory being used was cut down by half and the run finished without any errors.

But for sure, this error is caused by out of RAM: link here

Re: Fatal error: Python kernel is unresponsive

Dicer — Tue, 25 Oct 2022 19:33:18 GMT

@orian hindi I also think the problem is insufficient RAM. But I already deployed 6-8 Standard_NC6s_v3 (GPU-accelerated compute) in Azure Databricks.

Is it still not enough for me to run Kmean clustering on 252000 data pints (n_cluster = 11, max iteration = 10) using SparkML and Sckit-learn?

Re: Fatal error: Python kernel is unresponsive

Dicer — Thu, 10 Nov 2022 03:53:50 GMT

@Vidula Khanna

@orian hindi

Today, I tried to transpose a big data set (Row: 252x17 Columns:1000). 999 columns are structured numerical float data and 1 column is a DateTime data type.

I deployed Standard_E4ds_v4 in Azure Databricks. That should be enough for transposing the big data.

Here is the code:

df_sp500_elements.pandas_api().set_index('stock_dateTime').T.reset_index().rename(columns={"index":"stock_dateTime"}).to_spark().show()

However, after running for 14.45 hours, there is still a `Fatal error: The Python kernel is unresponsive`.

This is the Ganglia:: cluster Report during transposition ::

This is the Event log`:

I think the `Fatal error: The Python kernel is unresponsive` is not caused by insufficient RAM.

This is my full `Fatal error: The Python kernel is unresponsive.` error message:

---------------------------------------------------------------------------

The Python process exited with an unknown exit code.

The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.

---------------------------------------------------------------------------

Last messages on stderr:

Wed Nov 9 12:46:54 2022 Connection to spark from PID 933

Wed Nov 9 12:46:54 2022 Initialized gateway on port 34615

Wed Nov 9 12:46:55 2022 Connected to spark.

/databricks/spark/python/pyspark/sql/dataframe.py:3605: FutureWarning: DataFrame.to_pandas_on_spark is deprecated. Use DataFrame.pandas_api instead.

warnings.warn(

ERROR:root:KeyboardInterrupt while sending command.

Traceback (most recent call last):

File "/databricks/spark/python/pyspark/sql/pandas/conversion.py", line 364, in _collect_as_arrow

results = list(batch_stream)

File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 56, in load_stream

for batch in self.serializer.load_stream(stream):

File "/databricks/spark/python/pyspark/sql/pandas/serializers.py", line 112, in load_stream

reader = pa.ipc.open_stream(stream)

File "/databricks/python/lib/python3.9/site-packages/pyarrow/ipc.py", line 154, in open_stream

return RecordBatchStreamReader(source)

Re: Fatal error: Python kernel is unresponsive

lizou — Fri, 11 Nov 2022 01:59:43 GMT

if a python process does not use spark, such as pandas (not spark pandas), only one node is used. I ran exact same error on a regular cluster with multiple nodes.

One solution is to use a single node with a lot of memory such as 128 G above. That means allocating enough resolution into a single node instead of splitting into multiple nodes.

however, I try to avoid pandas as most problems can be solved using spark except for some special utility where there is no other choice.

Re: Fatal error: Python kernel is unresponsive

RachelGomez123 — Fri, 11 Nov 2022 05:17:28 GMT

Tips to Fix a Fatal Error

Search for the error code to find specific instructions.

Update the software.

Update the drivers.

Uninstall any recently installed programs.

Restore Windows to an earlier state.

Disable unnecessary background programs.

Delete temporary files.

Free up space on the hard drive.

Regards,

Rachel Gomez

Re: Fatal error: Python kernel is unresponsive

Anonymous — Sat, 12 Nov 2022 03:26:56 GMT

Same error. This started a few days ago on notebooks that used to run fine in the past. Now, I cannot finish a notebook.

Re: Fatal error: Python kernel is unresponsive

Dicer — Sat, 12 Nov 2022 06:20:07 GMT

@lizou

Today, I have the same problem when I spark transpose 1000 columns x 4284 rows structured data matrix. The data size is about 2GB.

Here is the code:

https://github.com/NikhilSuthar/TransposeDataFrame

from pyspark.sql.functions import *
from pyspark.sql import SparkSession
 
def TransposeDF(df, columns, pivotCol):
    columnsValue = list(map(lambda x: str("'") + str(x) + str("',")  + str(x), columns))
    stackCols = ','.join(x for x in columnsValue)
    df_1 = df.selectExpr(pivotCol, "stack(" + str(len(columns)) + "," + stackCols + ")")\
             .select(pivotCol, "col0", "col1")
    final_df = df_1.groupBy(col("col0")).pivot(pivotCol).agg(concat_ws("", collect_list(col("col1"))))\
                   .withColumnRenamed("col0", pivotCol)
    return final_df
 
 
df = TransposeDF(df, df.columns[1:], "AAPL_dateTime")

(The above code works for transposing a small data matrix (eg. 5 columns x 252 rows) )

I deploy one 32GB memory VM and there is still a `Fatal error: Python kernel is unresponsive`

Transposing a data matrix should only have O(C x R) space complexity and runtime complexity.

In my case, that should be 2GB of space complexity.

I checked the Databricks Live metrics. Only 20% CPU is used and there is still 20 GB of free memory. However, there is a `Driver is up but not responsive, likely due to GC` in the event log.

I have no idea why there is still `Fatal error: Python kernel is unresponsive` 😂 . Perhaps, It is not only related to memory?😵

Now, I am trying one 112 GB memory GPU to transpose a 2 GB data matrix. And there is no `Driver is up but not responsive, likely due to GC` in the event log. Hope this works. But still cannot understand why transposing a 2 GB data matrix needs that amount of memory😅