Databricks Community

Mado · ‎12-12-2022

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.

Sample dataset:

# Prepare Data
data = [("Spain", 101, 201, 301), \
        ("Taiwan", 102, 202, 302), \
        ("Italy", 103, 203, 303), \
        ("China", 104, 204, 304)
  ]
 
# Create DataFrame
columns= ["Country", "2018", "2019", "2002"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

Below is the code I have tried:

from pyspark.sql import functions as F
 
unpivotExpr = "stack(3, '2018', 2018, '2019', 2019, '2020', 2020) as (Year, CPI)"
unPivotDF = df.select("Country", F.expr(unpivotExpr))
unPivotDF.show()

And the results:

As you can see in the above image, value of column "CPI" is the same as column "Year" which is not expected.

Any idea to solve this issue?

UmaMahesh1 · ‎12-12-2022

Hi @Mohammad Saber

The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.

To avoid this you can give more proper column names like below.

Else if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..

Hope this helps...

Cheers

Uma Mahesh D

View solution in original post

UmaMahesh1 · ‎12-12-2022

Hi @Mohammad Saber

The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.

To avoid this you can give more proper column names like below.

Else if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..

Hope this helps...

Cheers

Uma Mahesh D

arggg · ‎08-12-2023

on a another dummie example i can't reproduce this error, is there an explanation as to why this happens?

srinu1246 · ‎12-08-2023

If i have columns names as below how can i get data

unpivottest = "stack(2,'Turnover (Sas m)',Turnover (Sas m),'abc %', abc %) as (kpi_name, kpi_value)"

lukeoz · ‎05-02-2024

You can also use backticks around the column names that would otherwise be recognised as numbers.

from pyspark.sql import functions as F
 
unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)"
unPivotDF = df.select("Country", F.expr(unpivotExpr))
unPivotDF.show()