topic Re: Union and Column data types in Data Engineering

Union and Column data types

high-energy — Sat, 01 Jun 2024 17:36:12 GMT

I have three data frames that I create in python. I want to write all three of these to the same delta table. In code I bring the three of them together using the union operation.

When I do this the data in the columns is no longer aligned correctly.

I can bring two of the data frames together successfully. Adding the third data frame causes the misalignment.

I've verified that all of the columns are identically named.

What else should I be looking at? Is there a simpler approach to achieving this result?

Thanks,

Shawn

Re: Union and Column data types

sreeyv — Sat, 01 Jun 2024 19:09:47 GMT

Check the data type of the columns, are they all same, use a subset of the 3rd table maybe 2 or 3 rows by doing a LIMIT clause, this ensures it is working for atleast fewer records and if it works fine increase the LIMIT maybe there is one row which has bad data

Re: Union and Column data types

high-energy — Sun, 02 Jun 2024 12:40:02 GMT

No - the data types are not consistent. An example is a column that contains integers is a double in one data frame, but an integer in another.

Re: Union and Column data types

high-energy — Sat, 08 Jun 2024 12:43:35 GMT

Aligning the data types and column order across all three data frames before attempting to union them together solved the problem. The below snippet highlights what was happening.

data = [[2021, "test", "Albany", "M", 42]] df1 = spark.createDataFrame(data, schema="Year int, First_Name STRING, County STRING, Sex STRING, Count int") data2 = [["M", 2021, "test", "Albany", 42]] df2 = spark.createDataFrame(data2, schema="Sex STRING, Year int, First_Name STRING, County STRING, Count int") df3 = df1.union(df2) display(df3)