Split a row into multiple rows based on a column value in Spark SQL

rishigc — Thu, 25 Apr 2019 16:43:45 GMT

Hi,

I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding column has value. The other columns have Null.

My requirement is - whenever the Product column value (in a row) is composite (i.e. has more than one product, e.g. Bolt + Brush), the record must be split into two rows - 1 row each for the composite product types. So, in this example, notice how the 2nd row gets split into 2 rows -> 1 row for "Bolt" and another for the "Brush", with their Price extracted from their corresponding columns (i.e in this case, "Bolt" = $3.99 and "Brush" = $6.99)

Note: For composite product values there can be at most 2 products as shown in this example (e.g. Bolt + Brush)

Input:

CustIdProductHammerBoltBrush1234Hammer$5.99 Null Null 7639 Bolt + Brush Null $3.99 $6.996322Brush Null NULL

$6.99

Required Output:

CustIdProductPrice1234Hammer$5.99 7639 Bolts $3.99 7639 Brush $6.996322Brush$6.99

Can anyone kindly help me solve the same. This has to be solved by Spark-SQL only.

Regards

Re: Split a row into multiple rows based on a column value in Spark SQL

mathan_pillai — Fri, 26 Apr 2019 10:31:30 GMT

Hi @rishigc

You can use something like below.

SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df

df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select(
  $"CustId", $"prodAndPrice.Product", $"prodAndPrice.Price").show()

Here df is the dataframe.

split

function splits the column into array of products & array of prices. These 2 arrays will be merged by

arrays_zip

, so that Nth product will be mapped to Nth price. Then the merged array is exploded using

explode

, so that each element in the array becomes a separate row.

please let us know if it works.

Thanks

topic Split a row into multiple rows based on a column value in Spark SQL in Data Engineering

Split a row into multiple rows based on a column value in Spark SQL

Re: Split a row into multiple rows based on a column value in Spark SQL