topic How to do a unionAll() when the number and the name of columns are different? in Data Engineering

How to do a unionAll() when the number and the name of columns are different?

User15787040559 — Fri, 18 Jun 2021 16:42:06 GMT

Looking at the API for Dataframe.unionAll() when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.

How can you do it?

One possible solution is using the following function which performs the union of two dataframes with different schemas and returns a combined dataframe:

import pyspark.sql.functions as F
    
def union_different_schemas(df1, df2):
   # Get a list of all column names in both dfs
   columns_df1 = df1.columns
   columns_df2 = df2.columns
   # Get a list of datatypes of the columns
   data_types_df1 = [i.dataType for i in df1.schema.fields]
   data_types_df2 = [i.dataType for i in df2.schema.fields]
   # We go through all columns in df1 and if they are not in df2, we add 
   # them (and specify the correct datatype too)
   for col, typ in zip(columns_df1, data_types_df1):
      if col not in df2.columns:
         df2 = df2\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df2 has all missing columns from df1, let's do the same for df1
   for col, typ in zip(columns_df2, data_types_df2):
      if col not in df1.columns:
         df1 = df1\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df1 and df2 have the same columns, not necessarily in the same 
   # order, therefore we use unionByName
   combined_df = df1\
      .unionByName(df2)
 
   return combined_df

Re: How to do a unionAll() when the number and the name of columns are different?

sean_owen — Fri, 18 Jun 2021 17:11:30 GMT

I'm not sure union is the right tool, if the DataFrames have fundamentally different information in them. If the difference is merely column name, yes, rename. If they don't, then the 'union' contemplated here is really a union of columns as well as rows. Yeah you can always produce a DataFrame with all rows from both, with all cols from both, with nulls where one DataFrame doesn't have the other's column(s). That could make sense but I'd also wonder whether a join() is what's intended in some cases like this.

Re: How to do a unionAll() when the number and the name of columns are different?

User15787040559 — Fri, 18 Jun 2021 17:25:32 GMT

The objective here is to append the rows of Dataframe 2 to the rows of Dataframe 1.