Databricks Community

high-energy · ‎06-01-2024

I have three data frames that I create in python. I want to write all three of these to the same delta table. In code I bring the three of them together using the union operation.

When I do this the data in the columns is no longer aligned correctly.

I can bring two of the data frames together successfully. Adding the third data frame causes the misalignment.

I've verified that all of the columns are identically named.

What else should I be looking at? Is there a simpler approach to achieving this result?

Thanks,

Shawn

high-energy · ‎06-08-2024

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)

View solution in original post

sreeyv · ‎06-01-2024

Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data

high-energy · ‎06-02-2024

No - the data types are not consistent. An example is a column that contains integers is a double in one data frame, but an integer in another.

high-energy · ‎06-08-2024

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]]

df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int")

data2 = [["M", 2021, "test", "Albany", 42]]

df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int")

df3 = df1.union(df2)

display(df3)

Databricks Community

Union and Column data types

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences