Databricks Community

memo · ‎11-28-2023

I want to pass multiple column as argument to pivot a dataframe in pyspark pivot like

mydf.groupBy("id").pivot("day","city").agg(F.sum("price").alias("price"),F.sum("units").alias("units")).show().

One way I found is to create multiple df with different pivot and join them which will result in multiple scan. But is there any other way to do this?

Kaniz_Fatma · ‎11-28-2023

Hi @memo , Certainly! You can achieve a more compact solution for pivoting multiple columns in PySpark. Instead of creating separate dataframes for each pivot, you can use a function to handle the pivoting process.

memo · ‎11-29-2023

Like how will the pass multiple values to the pivot function? It only takes one argument. I tried with sending an array, list. But it is throwing errors

Kaniz_Fatma · ‎11-29-2023

Hi @memo , Let’s call this function pivot_udf. Here’s how you can implement it:

from pyspark.sql import functions as F

def pivot_udf(df, *cols):
    mydf = df.select('id').drop_duplicates()
    for c in cols:
        mydf = mydf.join(
            df.withColumn('combcol', F.concat(F.lit(f'{c}_'), df['day']))
            .groupby('id')
            .pivot('combcol')
            .agg(F.first(c)),
            'id'
        )
    return mydf

# Example usage:
d = [
    (100, 1, 23, 10),
    (100, 2, 45, 11),
    # ... other data ...
]
mydf = spark.createDataFrame(d, ['id', 'day', 'price', 'units'])

# Pivot on 'price' and 'units'
result_df = pivot_udf(mydf, 'price', 'units')
result_df.show()

The resulting DataFrame will have columns for each combination of day and column (e.g., price_1, price_2, units_1, etc.).

This approach avoids creating separate DataFrames for each pivot and joining them, leading to a more...

Feel free to adapt the pivot_udf function to your specific use case by adding more columns as needed. Happy pivoting! 🚀