I am running jupyter notebook on a cluster with configuration:
12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)
Worker type: i3.xlarge 30.5gb memory, 4 cores
Min 2 and max 8 workers
cursor = conn.cursor()
cursor.execute(
"""
SELECT * FROM (
select *, random() as sample
from tbl
WHERE type = 'type1'
AND created_at >= '2022-01-01 00:00:00'
AND created_at <= '2022-06-30 00:00:00') as samp
WHERE sample < .05; -- return 5% of rows
"""
)
# To Pandas DataFrame
df = DataFrame(cursor.fetchall())
# # # Get column names
field_names = [i[0] for i in cursor.description]
df.columns = field_names
Fatal error: The Python kernel is unresponsive.
---------------------------------------------------------------------------
The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.
The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
---------------------------------------------------------------------------
Last messages on stderr:
ks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1112, in structured_traceback
return FormattedTB.structured_traceback(
File "/databricks/python/lib/python3.9/site-packages/IPython/core/ultratb.py", line 1006, in structured_traceback
File "/databricks/python/lib/python3.9/site-packages/stack_data/core.py", line 649, in included_pieces
pos = scope_pieces.index(self.executing_piece)
File "/databricks/python/lib/python3.9/site-packages/stack_data/utils.py", line 145, in cached_property_wrapper
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/databricks/python/lib/python3.9/site-packages/executing/executing.py", line 164, in only
raise NotOneValueFound('Expected one value, found 0')
executing.executing.NotOneValueFound: Expected one value, found 0
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/redshift_connector/core.py", line 1631, in execute
ps = cache["ps"][key]
KeyError: ('SELECT * \nFROM \n SELECT * \n FROM tbl\n WHERE event_type = %s\n AND created_at >= %s\n AND created_at <= %s \n \n LIMIT 5', ((<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>), (<RedshiftOID.UNKNOWN: 705>, 0, <function text_out at 0x7f7a348d3790>)))
The data is fairly large, probably 300-400m rows. what configuration do I need to modify? or optimize the query? parallel processing etc.?