PohlPosition
Databricks Employee
Databricks Employee

If I understand you correctly, you have a large array of tokens, and you want to filter that large array against a smaller array of tokens.

You should convert these arrays into RDDs and then use the intersect() function to just return the tokens in common between the two lists:

val listofECtokens: Array[String] = Array("EC-17A5206955089011B", "EC-17A5206955089011A")

//Turn this array into an RDD

val listofECtokensRDD = sc.parallelize(listofECtokens)

//Create a bigger RDD of tokens

val biggerListofECtokensRDD = sc.parallelize(Array("EC-17A5206955089011B", "EC-17A5206955089011A", "EC-15B5206955089011A", "EC-12C5206955089011A"))

//Collect just the intersection of tokens between the two RDDs

val filteredRDD = biggerListofECtokensRDD.intersection(listofECtokensRDD).collect()

Please note that when using collect() all of the filtered data will be sent back to the driver machine. For small examples like this, it is acceptable. But for big data, you may run into out of memory errors.