cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Split a row into multiple rows based on a column value in Spark SQL

rishigc
New Contributor

Hi,

I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding column has value. The other columns have Null.

My requirement is - whenever the Product column value (in a row) is composite (i.e. has more than one product, e.g. Bolt + Brush), the record must be split into two rows - 1 row each for the composite product types. So, in this example, notice how the 2nd row gets split into 2 rows -> 1 row for "Bolt" and another for the "Brush", with their Price extracted from their corresponding columns (i.e in this case, "Bolt" = $3.99 and "Brush" = $6.99)

Note: For composite product values there can be at most 2 products as shown in this example (e.g. Bolt + Brush)

Input:

CustIdProductHammerBoltBrush1234Hammer$5.99 Null Null 7639 Bolt + Brush Null $3.99 $6.996322Brush Null NULL

$6.99

Required Output:

CustIdProductPrice1234Hammer$5.99 7639 Bolts $3.99 7639 Brush $6.996322Brush$6.99

Can anyone kindly help me solve the same. This has to be solved by Spark-SQL only.

Regards

1 REPLY 1

mathan_pillai
Valued Contributor
Valued Contributor

Hi @rishigc

You can use something like below.

SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df

or

df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select(
  $"CustId", $"prodAndPrice.Product", $"prodAndPrice.Price").show()

Here df is the dataframe.

split
function splits the column into array of products & array of prices. These 2 arrays will be merged by
arrays_zip
, so that Nth product will be mapped to Nth price. Then the merged array is exploded using
explode
, so that each element in the array becomes a separate row.

please let us know if it works.

Thanks

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.