- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2022 10:20 PM
@lizou
Today, I have the same problem when I spark transpose 1000 columns x 4284 rows structured data matrix. The data size is about 2GB.
Here is the code:
https://github.com/NikhilSuthar/TransposeDataFrame
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
def TransposeDF(df, columns, pivotCol):
columnsValue = list(map(lambda x: str("'") + str(x) + str("',") + str(x), columns))
stackCols = ','.join(x for x in columnsValue)
df_1 = df.selectExpr(pivotCol, "stack(" + str(len(columns)) + "," + stackCols + ")")\
.select(pivotCol, "col0", "col1")
final_df = df_1.groupBy(col("col0")).pivot(pivotCol).agg(concat_ws("", collect_list(col("col1"))))\
.withColumnRenamed("col0", pivotCol)
return final_df
df = TransposeDF(df, df.columns[1:], "AAPL_dateTime")(The above code works for transposing a small data matrix (eg. 5 columns x 252 rows) )
I deploy one 32GB memory VM and there is still a `Fatal error: Python kernel is unresponsive`
Transposing a data matrix should only have O(C x R) space complexity and runtime complexity.
In my case, that should be 2GB of space complexity.
I checked the Databricks Live metrics. Only 20% CPU is used and there is still 20 GB of free memory. However, there is a `Driver is up but not responsive, likely due to GC` in the event log.
I have no idea why there is still `Fatal error: Python kernel is unresponsive` 😂 . Perhaps, It is not only related to memory?😵
Now, I am trying one 112 GB memory GPU to transpose a 2 GB data matrix. And there is no `Driver is up but not responsive, likely due to GC` in the event log. Hope this works. But still cannot understand why transposing a 2 GB data matrix needs that amount of memory😅