cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to do a unionAll() when the number and the name of columns are different?

User15787040559
New Contributor III

Looking at the API for Dataframe.unionAll() when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.

How can you do it?

One possible solution is using the following function which performs the union of two dataframes with different schemas and returns a combined dataframe:

import pyspark.sql.functions as F
    
def union_different_schemas(df1, df2):
   # Get a list of all column names in both dfs
   columns_df1 = df1.columns
   columns_df2 = df2.columns
   # Get a list of datatypes of the columns
   data_types_df1 = [i.dataType for i in df1.schema.fields]
   data_types_df2 = [i.dataType for i in df2.schema.fields]
   # We go through all columns in df1 and if they are not in df2, we add 
   # them (and specify the correct datatype too)
   for col, typ in zip(columns_df1, data_types_df1):
      if col not in df2.columns:
         df2 = df2\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df2 has all missing columns from df1, let's do the same for df1
   for col, typ in zip(columns_df2, data_types_df2):
      if col not in df1.columns:
         df1 = df1\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df1 and df2 have the same columns, not necessarily in the same 
   # order, therefore we use unionByName
   combined_df = df1\
      .unionByName(df2)
 
   return combined_df

2 REPLIES 2

sean_owen
Honored Contributor II
Honored Contributor II

I'm not sure union is the right tool, if the DataFrames have fundamentally different information in them. If the difference is merely column name, yes, rename. If they don't, then the 'union' contemplated here is really a union of columns as well as rows. Yeah you can always produce a DataFrame with all rows from both, with all cols from both, with nulls where one DataFrame doesn't have the other's column(s). That could make sense but I'd also wonder whether a join() is what's intended in some cases like this.

The objective here is to append the rows of Dataframe 2 to the rows of Dataframe 1.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.