topic Re: Update target column with optional source columns in Data Engineering

Update target column with optional source columns

RUC — Tue, 18 Feb 2025 21:33:47 GMT

I am trying to load on prem transaction log table and update a databricks table.In Source table,the column reqdetails table hold all the information and expect ProductID remaining columns are dynamic(not all columns exists in the request) all time the optional/nullable.Could you please let me the process using python in databricks ? Explored StructType but it needs mandatory specific columns.

Source Table

Req ID	Type	Req Details	Status
1	Update	ProductID=234;ProductName=LawnMover;Price=58	True
2	Update	ProductID=874;Price=478	True
3	Update	ProductID=678;ProductParentgroup=Watersuppuly;Price=1.6	True

Target table before Update

ProductID	ProductParentgroup	ProductName	Price
234	Utility	Mover	86
874	HOA	Sink	450
678	Water	Filters	1.2

Target table after Update

ProductID	ProductParentgroup	ProductName	Price
234	Utility	LawnMover	58
874	HOA	Sink	478
678	Watersupply	Filters	1.6

Re: Update target column with optional source columns

Brahmareddy — Mon, 24 Feb 2025 20:21:14 GMT

Hi Ruc

How are you doing today?, as per my understanding, Since your source table has a ReqDetails column with dynamic key-value pairs and only ProductID is fixed, you can parse the ReqDetails column dynamically and merge it into the target table using PySpark and Delta Lake. First, read the source data into a DataFrame and use a UDF (User-Defined Function) to split ReqDetails into a dictionary of key-value pairs. Then, extract only the available keys (e.g., ProductName, Price, ProductParentgroup) and create new columns dynamically. Finally, use Delta Lake’s MERGE operation to update only the modified values in the target table, using COALESCE() to retain existing data when a column is missing in the request. This way, you avoid a rigid schema while ensuring that updates happen correctly. Hope this helps!

Regards,

Brahma