<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to do a unionAll() when the number and the name of columns are different? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22734#M15625</link>
    <description>&lt;P&gt;Looking at the API for &lt;A href="http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionAll.html?highlight=unionall" alt="http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionAll.html?highlight=unionall" target="_blank"&gt;Dataframe.unionAll()&lt;/A&gt; when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.&lt;/P&gt;&lt;P&gt;How can you do it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One possible solution is using the following function which performs the union of two dataframes with different schemas and returns a combined dataframe:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark.sql.functions as F
    
def union_different_schemas(df1, df2):
   # Get a list of all column names in both dfs
   columns_df1 = df1.columns
   columns_df2 = df2.columns
   # Get a list of datatypes of the columns
   data_types_df1 = [i.dataType for i in df1.schema.fields]
   data_types_df2 = [i.dataType for i in df2.schema.fields]
   # We go through all columns in df1 and if they are not in df2, we add 
   # them (and specify the correct datatype too)
   for col, typ in zip(columns_df1, data_types_df1):
      if col not in df2.columns:
         df2 = df2\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df2 has all missing columns from df1, let's do the same for df1
   for col, typ in zip(columns_df2, data_types_df2):
      if col not in df1.columns:
         df1 = df1\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df1 and df2 have the same columns, not necessarily in the same 
   # order, therefore we use unionByName
   combined_df = df1\
      .unionByName(df2)
&amp;nbsp;
   return combined_df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 18 Jun 2021 16:42:06 GMT</pubDate>
    <dc:creator>User15787040559</dc:creator>
    <dc:date>2021-06-18T16:42:06Z</dc:date>
    <item>
      <title>How to do a unionAll() when the number and the name of columns are different?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22734#M15625</link>
      <description>&lt;P&gt;Looking at the API for &lt;A href="http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionAll.html?highlight=unionall" alt="http://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionAll.html?highlight=unionall" target="_blank"&gt;Dataframe.unionAll()&lt;/A&gt; when you have 2 different dataframes with different number of columns and names unionAll() doesn't work.&lt;/P&gt;&lt;P&gt;How can you do it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;One possible solution is using the following function which performs the union of two dataframes with different schemas and returns a combined dataframe:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import pyspark.sql.functions as F
    
def union_different_schemas(df1, df2):
   # Get a list of all column names in both dfs
   columns_df1 = df1.columns
   columns_df2 = df2.columns
   # Get a list of datatypes of the columns
   data_types_df1 = [i.dataType for i in df1.schema.fields]
   data_types_df2 = [i.dataType for i in df2.schema.fields]
   # We go through all columns in df1 and if they are not in df2, we add 
   # them (and specify the correct datatype too)
   for col, typ in zip(columns_df1, data_types_df1):
      if col not in df2.columns:
         df2 = df2\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df2 has all missing columns from df1, let's do the same for df1
   for col, typ in zip(columns_df2, data_types_df2):
      if col not in df1.columns:
         df1 = df1\
            .withColumn(col, F.lit(None).cast(typ))
   # Now df1 and df2 have the same columns, not necessarily in the same 
   # order, therefore we use unionByName
   combined_df = df1\
      .unionByName(df2)
&amp;nbsp;
   return combined_df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jun 2021 16:42:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22734#M15625</guid>
      <dc:creator>User15787040559</dc:creator>
      <dc:date>2021-06-18T16:42:06Z</dc:date>
    </item>
    <item>
      <title>Re: How to do a unionAll() when the number and the name of columns are different?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22735#M15626</link>
      <description>&lt;P&gt;I'm not sure union is the right tool, if the DataFrames have fundamentally different information in them. If the difference is merely column name, yes, rename. If they don't, then the 'union' contemplated here is really a union of columns as well as rows. Yeah you can always produce a DataFrame with all rows from both, with all cols from both, with nulls where one DataFrame doesn't have the other's column(s). That could make sense but I'd also wonder whether a join() is what's intended in some cases like this.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jun 2021 17:11:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22735#M15626</guid>
      <dc:creator>sean_owen</dc:creator>
      <dc:date>2021-06-18T17:11:30Z</dc:date>
    </item>
    <item>
      <title>Re: How to do a unionAll() when the number and the name of columns are different?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22736#M15627</link>
      <description>&lt;P&gt;The objective here is to append the rows of Dataframe 2 to the rows of Dataframe 1.&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jun 2021 17:25:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-do-a-unionall-when-the-number-and-the-name-of-columns-are/m-p/22736#M15627</guid>
      <dc:creator>User15787040559</dc:creator>
      <dc:date>2021-06-18T17:25:32Z</dc:date>
    </item>
  </channel>
</rss>

