cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to union multiple dataframe in pyspark within Databricks notebook

RaymondXie
New Contributor

I have 4 DFs: Avg_OpenBy_Year, AvgHighBy_Year, AvgLowBy_Year and AvgClose_By_Year, all of them have a common column of 'Year'.

I want to join the three together to get a final df like:

`Year, Open, High, Low, Close`

At the moment I have to use the ugly way to join them on column 'Year':
finalDF = Avg_Open_By_Year
.join(Avg_High_By_Year, on=['Year'], how='left_outer')
.join(Avg_Low_By_Year, on=['Year'], how='left_outer')
.join(Avg_Close_By_Year, on=['Year'], how='left_outer')

I think there should be a grace way to accomplish this, like UnionAll in SQL.

There is a possible solution here https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspa..., the selected answer is described below:from functools import reduce # For Python 3.x from pyspark.sql import DataFrame

def unionAll(*dfs): return reduce(DataFrame.unionAll, dfs)

unionAll(td2, td3, td4, td5, td6, td7, td8, td9, td10)

However, I am doing this in Databricks notebook, it throws me error:

 NameError: name 'functools' is not defined

0693f000007OoI6AAK

It would really be appreciated if someone can shed me with more light. Thank you very much.

1 REPLY 1

thiago_matos
New Contributor II

Import reduce function in this way:

from functools import reduce

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.