cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

parse and combine multiple datasets within a single file

Sandesh87
New Contributor III

An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one column

name,gender,id

sam,m,001

-----

time,x,y,z,long,lat

160,22,45,51,83,56

230,82,95,48,18,26

-----

event,a,b,c

034,1,5,6

073,4,2,8

Each message may contain 3 datasets separated by the five dashes -----

dataset1:

name,gender,id

sam,m,001

name,gender,id is header information in dataset1

dataset2:

time,x,y,z,long,lat

160,22,45,51,83,56

230,82,95,48,18,26

time,x,y,z,long,lat is header information in dataset2

dataset3:

event,a,b,c

034,1,5,6

073,4,2,8

event,a,b,c is the header information in dataset3

The application is a spark streaming one and batches multiple such messages into one dataframe. For example a dataframe with one column that loads three messages from event hub could look like below

name,gender,id

sam,m,001

-----

time,x,y,z,long,lat

160,22,45,51,83,56

230,82,95,48,18,26

-----

event,a,b,c

034,1,5,6

073,4,2,8

name,gender,id

janet,f,002

-----

time,x,y,z,long,lat

839,22,08,81,91,23

110,42,68,31,74,45

name,gender,id

ross,m,003

-----

time,x,y,z,long,lat

209,33,10,11,61,47

230,82,95,48,18,26

246,91,82,92,28,98

230,03,62,56,02,42

-----

event,a,b,c

034,4,1,0

092,9,8,3

dataset with header information event,a,b,c may or may not be present in the message as can be seen in the message with name 'janet' in the above.

Objective is to combine datasets 1 and 2 related to the specific message. dataset3 is excluded. The result should look like:

name gender id time x y z long lat

sam m 001 160 22 45 51 83 56

sam m 001 230 82 95 48 18 26

janet f 002 839 22 08 81 91 23

janet f 002 110 42 68 31 74 45

ross m 003 209 33 10 11 61 47

ross m 003 230 82 95 48 18 26

ross m 003 246 91 82 92 28 98

ross m 003 230 03 62 56 02 42

How to achieve this using scala?

3 REPLIES 3

Anonymous
Not applicable

I would say don't use spark for data that has awful/no schemas. Use spark for scale and data with a schema. Maybe try to fix whatever is creating these messages.

Sandesh87
New Contributor III

appreciate the feedback but i cant control what is coming through event hub. The message is just the way it is and cant be undone

Vartika
Moderator
Moderator

Hi @Sandesh Puligundla​ 

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.