Dicer
Valued Contributor

@lizou​ 

Today, I have the same problem when I spark transpose 1000 columns x 4284 rows structured data matrix. The data size is about 2GB.

Here is the code:

https://github.com/NikhilSuthar/TransposeDataFrame

from pyspark.sql.functions import *
from pyspark.sql import SparkSession
 
def TransposeDF(df, columns, pivotCol):
    columnsValue = list(map(lambda x: str("'") + str(x) + str("',")  + str(x), columns))
    stackCols = ','.join(x for x in columnsValue)
    df_1 = df.selectExpr(pivotCol, "stack(" + str(len(columns)) + "," + stackCols + ")")\
             .select(pivotCol, "col0", "col1")
    final_df = df_1.groupBy(col("col0")).pivot(pivotCol).agg(concat_ws("", collect_list(col("col1"))))\
                   .withColumnRenamed("col0", pivotCol)
    return final_df
 
 
df = TransposeDF(df, df.columns[1:], "AAPL_dateTime")

(The above code works for transposing a small data matrix (eg. 5 columns x 252 rows) )

I deploy one 32GB memory VM and there is still a `Fatal error: Python kernel is unresponsive`

Transposing a data matrix should only have O(C x R) space complexity and runtime complexity.

In my case, that should be 2GB of space complexity.

I checked the Databricks Live metrics. Only 20% CPU is used and there is still 20 GB of free memory. However, there is a `Driver is up but not responsive, likely due to GC` in the event log.

I have no idea why there is still `Fatal error: Python kernel is unresponsive` 😂 . Perhaps, It is not only related to memory?😵

Now, I am trying one 112 GB memory GPU to transpose a 2 GB data matrix. And there is no `Driver is up but not responsive, likely due to GC` in the event log. Hope this works. But still cannot understand why transposing a 2 GB data matrix needs that amount of memory😅