cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Union and Column data types

high-energy
New Contributor III

I have three data frames that I create in python. I want to write all three of these to the same delta table. In code I bring the three of them together using the union operation.

When I do this the data in the columns is no longer aligned correctly.

I can bring two of the data frames together successfully. Adding the third data frame causes the misalignment.

I've verified that all of the columns are identically named.

What else should I be looking at? Is there a simpler approach to achieving this result?

Thanks,

Shawn

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

high-energy
New Contributor III

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)

View solution in original post

3 REPLIES 3

sreeyv
New Contributor II

Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data

high-energy
New Contributor III

No - the data types are not consistent. An example is a column that contains integers is a double in one data frame, but an integer in another. 

high-energy
New Contributor III

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!