cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using "Select Expr" and "Stack" to Unpivot PySpark DataFrame doesn't produce expected results

Mado
Valued Contributor II

I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.

Sample dataset:

# Prepare Data
data = [("Spain", 101, 201, 301), \
        ("Taiwan", 102, 202, 302), \
        ("Italy", 103, 203, 303), \
        ("China", 104, 204, 304)
  ]
 
# Create DataFrame
columns= ["Country", "2018", "2019", "2002"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)

image 

Below is the code I have tried:

from pyspark.sql import functions as F
 
unpivotExpr = "stack(3, '2018', 2018, '2019', 2019, '2020', 2020) as (Year, CPI)"
unPivotDF = df.select("Country", F.expr(unpivotExpr))
unPivotDF.show()

And the results:

imageAs you can see in the above image, value of column "CPI" is the same as column "Year" which is not expected.

Any idea to solve this issue?

1 ACCEPTED SOLUTION

Accepted Solutions

UmaMahesh1
Honored Contributor III

Hi @Mohammad Saber​ 

The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.

To avoid this you can give more proper column names like below.

imageElse if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..

Hope this helps...

Cheers

Uma Mahesh D

View solution in original post

4 REPLIES 4

UmaMahesh1
Honored Contributor III

Hi @Mohammad Saber​ 

The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.

To avoid this you can give more proper column names like below.

imageElse if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..

Hope this helps...

Cheers

Uma Mahesh D

arggg
New Contributor II

on a another dummie example i can't reproduce this error, is there an explanation as to why this happens?

srinu1246
New Contributor II

If i have columns names as below how can i get data

 

unpivottest = "stack(2,'Turnover (Sas  m)',Turnover (Sas  m),'abc %', abc %) as (kpi_name, kpi_value)"                        

lukeoz
New Contributor II

You can also use backticks around the column names that would otherwise be recognised as numbers.

from pyspark.sql import functions as F
 
unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)"
unPivotDF = df.select("Country", F.expr(unpivotExpr))
unPivotDF.show()

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group