<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Union and Column data types in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71329#M34292</link>
    <description>&lt;P&gt;Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data&lt;/P&gt;</description>
    <pubDate>Sat, 01 Jun 2024 19:09:47 GMT</pubDate>
    <dc:creator>sreeyv</dc:creator>
    <dc:date>2024-06-01T19:09:47Z</dc:date>
    <item>
      <title>Union and Column data types</title>
      <link>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71325#M34289</link>
      <description>&lt;P&gt;I have three data frames that I create in python. I want to write all three of these to the same delta table. In code I bring the three of them together using the union operation.&lt;/P&gt;&lt;P&gt;When I do this the data in the columns is no longer aligned correctly.&lt;/P&gt;&lt;P&gt;I can bring two of the data frames together successfully. Adding the third data frame causes the misalignment.&lt;/P&gt;&lt;P&gt;I've verified that all of the columns are identically named.&lt;/P&gt;&lt;P&gt;What else should I be looking at? Is there a simpler approach to achieving this result?&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Shawn&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2024 17:36:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71325#M34289</guid>
      <dc:creator>high-energy</dc:creator>
      <dc:date>2024-06-01T17:36:12Z</dc:date>
    </item>
    <item>
      <title>Re: Union and Column data types</title>
      <link>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71329#M34292</link>
      <description>&lt;P&gt;Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data&lt;/P&gt;</description>
      <pubDate>Sat, 01 Jun 2024 19:09:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71329#M34292</guid>
      <dc:creator>sreeyv</dc:creator>
      <dc:date>2024-06-01T19:09:47Z</dc:date>
    </item>
    <item>
      <title>Re: Union and Column data types</title>
      <link>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71361#M34296</link>
      <description>&lt;P&gt;No - the data types are not consistent. An example is a column that contains integers is a double in one data frame, but an integer in another.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 02 Jun 2024 12:40:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/71361#M34296</guid>
      <dc:creator>high-energy</dc:creator>
      <dc:date>2024-06-02T12:40:02Z</dc:date>
    </item>
    <item>
      <title>Re: Union and Column data types</title>
      <link>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/72123#M34499</link>
      <description>&lt;P&gt;Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)&lt;/LI-CODE&gt;</description>
      <pubDate>Sat, 08 Jun 2024 12:43:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/union-and-column-data-types/m-p/72123#M34499</guid>
      <dc:creator>high-energy</dc:creator>
      <dc:date>2024-06-08T12:43:35Z</dc:date>
    </item>
  </channel>
</rss>

