- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2022 09:07 PM
I am trying to unpivot a PySpark DataFrame, but I don't get the correct results.
Sample dataset:
# Prepare Data
data = [("Spain", 101, 201, 301), \
("Taiwan", 102, 202, 302), \
("Italy", 103, 203, 303), \
("China", 104, 204, 304)
]
# Create DataFrame
columns= ["Country", "2018", "2019", "2002"]
df = spark.createDataFrame(data = data, schema = columns)
df.show(truncate=False)
Below is the code I have tried:
from pyspark.sql import functions as F
unpivotExpr = "stack(3, '2018', 2018, '2019', 2019, '2020', 2020) as (Year, CPI)"
unPivotDF = df.select("Country", F.expr(unpivotExpr))
unPivotDF.show()
And the results:
As you can see in the above image, value of column "CPI" is the same as column "Year" which is not expected.
Any idea to solve this issue?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2022 11:18 PM
Hi @Mohammad Saber
The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.
To avoid this you can give more proper column names like below.
Else if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..
Hope this helps...
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2022 11:18 PM
Hi @Mohammad Saber
The issue is because the column name is similar to a literal value and it is taking that constant value for all the keys provided.
To avoid this you can give more proper column names like below.
Else if you have a way where you can explicitly say to spark that 2018 is the column value and not a literal, you can try that too..
Hope this helps...
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2023 04:56 AM
on a another dummie example i can't reproduce this error, is there an explanation as to why this happens?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2023 12:32 AM
If i have columns names as below how can i get data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2024 05:40 PM
You can also use backticks around the column names that would otherwise be recognised as numbers.
from pyspark.sql import functions as F unpivotExpr = "stack(3, '2018', `2018`, '2019', `2019`, '2020', `2020`) as (Year, CPI)" unPivotDF = df.select("Country", F.expr(unpivotExpr)) unPivotDF.show()

