An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one column
name,gender,id
sam,m,001
-----
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
-----
event,a,b,c
034,1,5,6
073,4,2,8
Each message may contain 3 datasets separated by the five dashes -----
dataset1:
name,gender,id
sam,m,001
name,gender,id is header information in dataset1
dataset2:
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
time,x,y,z,long,lat is header information in dataset2
dataset3:
event,a,b,c
034,1,5,6
073,4,2,8
event,a,b,c is the header information in dataset3
The application is a spark streaming one and batches multiple such messages into one dataframe. For example a dataframe with one column that loads three messages from event hub could look like below
name,gender,id
sam,m,001
-----
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
-----
event,a,b,c
034,1,5,6
073,4,2,8
name,gender,id
janet,f,002
-----
time,x,y,z,long,lat
839,22,08,81,91,23
110,42,68,31,74,45
name,gender,id
ross,m,003
-----
time,x,y,z,long,lat
209,33,10,11,61,47
230,82,95,48,18,26
246,91,82,92,28,98
230,03,62,56,02,42
-----
event,a,b,c
034,4,1,0
092,9,8,3
dataset with header information event,a,b,c may or may not be present in the message as can be seen in the message with name 'janet' in the above.
Objective is to combine datasets 1 and 2 related to the specific message. dataset3 is excluded. The result should look like:
name gender id time x y z long lat
sam m 001 160 22 45 51 83 56
sam m 001 230 82 95 48 18 26
janet f 002 839 22 08 81 91 23
janet f 002 110 42 68 31 74 45
ross m 003 209 33 10 11 61 47
ross m 003 230 82 95 48 18 26
ross m 003 246 91 82 92 28 98
ross m 003 230 03 62 56 02 42
How to achieve this using scala?