cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Pivot on multiple columns

memo
New Contributor II

I want to pass multiple column as argument to pivot a dataframe  in pyspark pivot like

mydf.groupBy("id").pivot("day","city").agg(F.sum("price").alias("price"),F.sum("units").alias("units")).show().
 
One way I found is to create multiple df with different pivot and join them which will result in multiple scan. But is there any other way to do this?
3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @memo , Certainly! You can achieve a more compact solution for pivoting multiple columns in PySpark. Instead of creating separate dataframes for each pivot, you can use a function to handle the pivoting process.

memo
New Contributor II

Like how will the pass multiple values to the pivot function? It only takes one argument. I tried with sending an array, list. But it is throwing errors

Kaniz_Fatma
Community Manager
Community Manager

Hi @memo , Letโ€™s call this function pivot_udf. Hereโ€™s how you can implement it:

from pyspark.sql import functions as F

def pivot_udf(df, *cols):
    mydf = df.select('id').drop_duplicates()
    for c in cols:
        mydf = mydf.join(
            df.withColumn('combcol', F.concat(F.lit(f'{c}_'), df['day']))
            .groupby('id')
            .pivot('combcol')
            .agg(F.first(c)),
            'id'
        )
    return mydf

# Example usage:
d = [
    (100, 1, 23, 10),
    (100, 2, 45, 11),
    # ... other data ...
]
mydf = spark.createDataFrame(d, ['id', 'day', 'price', 'units'])

# Pivot on 'price' and 'units'
result_df = pivot_udf(mydf, 'price', 'units')
result_df.show()

The resulting DataFrame will have columns for each combination of day and column (e.g., price_1, price_2, units_1, etc.). 

This approach avoids creating separate DataFrames for each pivot and joining them, leading to a more...

Feel free to adapt the pivot_udf function to your specific use case by adding more columns as needed. Happy pivoting! ๐Ÿš€

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group