cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Split a row into multiple rows based on a column value in Spark SQL

rishigc
New Contributor

Hi,

I am trying to split a record in a table to 2 records based on a column value. Please refer to the sample below. The input table displays the 3 types of Product and their price. Notice that for a specific Product (row) only its corresponding column has value. The other columns have Null.

My requirement is - whenever the Product column value (in a row) is composite (i.e. has more than one product, e.g. Bolt + Brush), the record must be split into two rows - 1 row each for the composite product types. So, in this example, notice how the 2nd row gets split into 2 rows -> 1 row for "Bolt" and another for the "Brush", with their Price extracted from their corresponding columns (i.e in this case, "Bolt" = $3.99 and "Brush" = $6.99)

Note: For composite product values there can be at most 2 products as shown in this example (e.g. Bolt + Brush)

Input:

CustIdProductHammerBoltBrush1234Hammer$5.99 Null Null 7639 Bolt + Brush Null $3.99 $6.996322Brush Null NULL

$6.99

Required Output:

CustIdProductPrice1234Hammer$5.99 7639 Bolts $3.99 7639 Brush $6.996322Brush$6.99

Can anyone kindly help me solve the same. This has to be solved by Spark-SQL only.

Regards

1 REPLY 1

mathan_pillai
Databricks Employee
Databricks Employee

Hi @rishigc

You can use something like below.

SELECT explode(arrays_zip(split(Product, '+'), split(Price, '+') ) as product_and_price from df

or

df.withColumn("product_and_price", explode(arrays_zip(split(Product, '+'), split(Price, '+'))).select(
  $"CustId", $"prodAndPrice.Product", $"prodAndPrice.Price").show()

Here df is the dataframe.

split
function splits the column into array of products & array of prices. These 2 arrays will be merged by
arrays_zip
, so that Nth product will be mapped to Nth price. Then the merged array is exploded using
explode
, so that each element in the array becomes a separate row.

please let us know if it works.

Thanks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group