parse and combine multiple datasets within a single file
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 10:28 AM
An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one column
name,gender,id
sam,m,001
-----
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
-----
event,a,b,c
034,1,5,6
073,4,2,8
Each message may contain 3 datasets separated by the five dashes -----
dataset1:
name,gender,id
sam,m,001
name,gender,id is header information in dataset1
dataset2:
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
time,x,y,z,long,lat is header information in dataset2
dataset3:
event,a,b,c
034,1,5,6
073,4,2,8
event,a,b,c is the header information in dataset3
The application is a spark streaming one and batches multiple such messages into one dataframe. For example a dataframe with one column that loads three messages from event hub could look like below
name,gender,id
sam,m,001
-----
time,x,y,z,long,lat
160,22,45,51,83,56
230,82,95,48,18,26
-----
event,a,b,c
034,1,5,6
073,4,2,8
name,gender,id
janet,f,002
-----
time,x,y,z,long,lat
839,22,08,81,91,23
110,42,68,31,74,45
name,gender,id
ross,m,003
-----
time,x,y,z,long,lat
209,33,10,11,61,47
230,82,95,48,18,26
246,91,82,92,28,98
230,03,62,56,02,42
-----
event,a,b,c
034,4,1,0
092,9,8,3
dataset with header information event,a,b,c may or may not be present in the message as can be seen in the message with name 'janet' in the above.
Objective is to combine datasets 1 and 2 related to the specific message. dataset3 is excluded. The result should look like:
name gender id time x y z long lat
sam m 001 160 22 45 51 83 56
sam m 001 230 82 95 48 18 26
janet f 002 839 22 08 81 91 23
janet f 002 110 42 68 31 74 45
ross m 003 209 33 10 11 61 47
ross m 003 230 82 95 48 18 26
ross m 003 246 91 82 92 28 98
ross m 003 230 03 62 56 02 42
How to achieve this using scala?
- Labels:
-
Azure databricks
-
Spark streaming

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 10:38 AM
I would say don't use spark for data that has awful/no schemas. Use spark for scale and data with a schema. Maybe try to fix whatever is creating these messages.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 10:58 AM
appreciate the feedback but i cant control what is coming through event hub. The message is just the way it is and cant be undone
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-03-2023 03:38 AM
Hi @Sandesh Puligundla
Hope all is well!
Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!

