Re: Spark SQL Group by duplicates, collect_list in...

Hubert-Dudek · ‎02-01-2022

In my opinion you took good direction as grouping collect_list (generally array or map) is way to go.

You need to write function to compare that elements and register as user defined function. You can even use multiple columns with arrays and pass them to function and return what you need. Function can handle any logic just with if and else.

Here is example code from internet used to compare two arrays. You can find many examples by searching for "spark udfs":

import scala.collection.mutable.WrappedArray
import org.apache.spark.sql.functions.col
 
val same_elements = udf { (a: WrappedArray[String], 
                           b: WrappedArray[String]) => 
  if (a.intersect(b).length == b.length){ 1 }else{ 0 }  
}
 
df.withColumn("test",same_elements(col("array1"),col("array2")))

My blog: https://databrickster.medium.com/