cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Union and Column data types

high-energy
New Contributor III

I have three data frames that I create in python. I want to write all three of these to the same delta table. In code I bring the three of them together using the union operation.

When I do this the data in the columns is no longer aligned correctly.

I can bring two of the data frames together successfully. Adding the third data frame causes the misalignment.

I've verified that all of the columns are identically named.

What else should I be looking at? Is there a simpler approach to achieving this result?

Thanks,

Shawn

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

high-energy
New Contributor III

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)

View solution in original post

3 REPLIES 3

sreeyv
New Contributor II

Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data

high-energy
New Contributor III

No - the data types are not consistent. An example is a column that contains integers is a double in one data frame, but an integer in another. 

high-energy
New Contributor III

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group