cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Fatal error: The Python kernel is unresponsive when attempting to query data from AWS Redshift within Jupyter notebook

kll
New Contributor III

I am running jupyter notebook on a cluster with configuration:

12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

Worker type: i3.xlarge 30.5gb memory, 4 cores

Min 2 and max 8 workers

cursor = conn.cursor()
 
cursor.execute(
               """
               SELECT * FROM (
                  select *, random() as sample 
                  from tbl
               WHERE type = 'type1'
               AND created_at >= '2022-01-01 00:00:00' 
               AND created_at <= '2022-06-30 00:00:00') as samp
               WHERE sample < .05; -- return 5% of rows
               """
              )
 
# To Pandas DataFrame
df = DataFrame(cursor.fetchall())
 
# # # Get column names
field_names = [i[0] for i in cursor.description]
df.columns = field_names

Fatal error: The Python kernel is unresponsive.
 
---------------------------------------------------------------------------
 
The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.
 
 
 
The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
 
---------------------------------------------------------------------------
 
Last messages on stderr:
 
ks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1112, in structured_traceback
 
    return FormattedTB.structured_traceback(
 
  File "/databricks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1006, in structured_traceback
 
  File "/databricks/python/lib/python3.9/site-packages/stack_data/core.py", line 649, in included_pieces
 
    pos = scope_pieces.index(self.executing_piece)
 
  File "/databricks/python/lib/python3.9/site-packages/stack_data/utils.py", line 145, in cached_property_wrapper
 
    value = obj.__dict__[self.func.__name__] = self.func(obj)
 
  File "/databricks/python/lib/python3.9/site-packages/executing/executing.py", line 164, in only
 
    raise NotOneValueFound('Expected one value, found 0')
 
executing.executing.NotOneValueFound: Expected one value, found 0
 
Traceback (most recent call last):
 
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/redshift_connector/core.py", line 1631, in execute
 
    ps = cache["ps"][key]
 
KeyError: ('SELECT * \nFROM \n             SELECT * \n             FROM tbl\n             WHERE event_type = %s\n             AND created_at >= %s\n             AND created_at <= %s \n              \n LIMIT 5', ((<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>)))

 The data is fairly large, probably 300-400m rows. what configuration do I need to modify? or optimize the query? parallel processing etc.?

1 REPLY 1

Debayan
Databricks Employee
Databricks Employee

Hi, Could you please confirm the usage of your cluster while running this job? you can monitor the performance here: https://docs.databricks.com/clusters/clusters-manage.html#monitor-performance with different metrics.

Also, please tag @Debayan​ with your next response which will notify me. Thank you!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group