<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic parse and combine multiple datasets within a single file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8079#M3799</link>
    <description>&lt;P&gt;An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one column&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Each message may contain 3 datasets separated by the five dashes -----&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset1: &lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id is header information in dataset1&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset2: &lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat is header information in dataset2&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset3: &lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;event,a,b,c is the header information in dataset3&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The application is a spark streaming one and batches multiple such messages into one dataframe. For example a dataframe with one column that loads three messages from event hub could look like below&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;janet,f,002&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;839,22,08,81,91,23&lt;/P&gt;&lt;P&gt;110,42,68,31,74,45&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;ross,m,003&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;209,33,10,11,61,47&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;246,91,82,92,28,98&lt;/P&gt;&lt;P&gt;230,03,62,56,02,42&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,4,1,0&lt;/P&gt;&lt;P&gt;092,9,8,3&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset with header information event,a,b,c may or may not be present in the message as can be seen in the message with name 'janet' in the above.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Objective is to combine datasets 1 and 2 related to the specific message. dataset3 is excluded. The result should look like:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name gender id time x y z long lat&lt;/P&gt;&lt;P&gt;sam m 001 160 22 45 51 83 56&lt;/P&gt;&lt;P&gt;sam m 001 230 82 95 48 18 26&lt;/P&gt;&lt;P&gt;janet f 002 839 22 08 81 91 23&lt;/P&gt;&lt;P&gt;janet f 002 110 42 68 31 74 45&lt;/P&gt;&lt;P&gt;ross m 003 209 33 10 11 61 47&lt;/P&gt;&lt;P&gt;ross m 003 230 82 95 48 18 26&lt;/P&gt;&lt;P&gt;ross m 003 246 91 82 92 28 98&lt;/P&gt;&lt;P&gt;ross m 003 230 03 62 56 02 42&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How to achieve this using scala?&lt;/P&gt;</description>
    <pubDate>Wed, 08 Mar 2023 18:28:33 GMT</pubDate>
    <dc:creator>Sandesh87</dc:creator>
    <dc:date>2023-03-08T18:28:33Z</dc:date>
    <item>
      <title>parse and combine multiple datasets within a single file</title>
      <link>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8079#M3799</link>
      <description>&lt;P&gt;An application receives messages from event hub. Below is a message received from event hub and loaded into a dataframe with one column&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Each message may contain 3 datasets separated by the five dashes -----&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset1: &lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id is header information in dataset1&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset2: &lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat is header information in dataset2&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset3: &lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;event,a,b,c is the header information in dataset3&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The application is a spark streaming one and batches multiple such messages into one dataframe. For example a dataframe with one column that loads three messages from event hub could look like below&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;sam,m,001&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;160,22,45,51,83,56&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,1,5,6&lt;/P&gt;&lt;P&gt;073,4,2,8&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;janet,f,002&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;839,22,08,81,91,23&lt;/P&gt;&lt;P&gt;110,42,68,31,74,45&lt;/P&gt;&lt;P&gt;name,gender,id&lt;/P&gt;&lt;P&gt;ross,m,003&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;time,x,y,z,long,lat&lt;/P&gt;&lt;P&gt;209,33,10,11,61,47&lt;/P&gt;&lt;P&gt;230,82,95,48,18,26&lt;/P&gt;&lt;P&gt;246,91,82,92,28,98&lt;/P&gt;&lt;P&gt;230,03,62,56,02,42&lt;/P&gt;&lt;P&gt;-----&lt;/P&gt;&lt;P&gt;event,a,b,c&lt;/P&gt;&lt;P&gt;034,4,1,0&lt;/P&gt;&lt;P&gt;092,9,8,3&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dataset with header information event,a,b,c may or may not be present in the message as can be seen in the message with name 'janet' in the above.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Objective is to combine datasets 1 and 2 related to the specific message. dataset3 is excluded. The result should look like:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;name gender id time x y z long lat&lt;/P&gt;&lt;P&gt;sam m 001 160 22 45 51 83 56&lt;/P&gt;&lt;P&gt;sam m 001 230 82 95 48 18 26&lt;/P&gt;&lt;P&gt;janet f 002 839 22 08 81 91 23&lt;/P&gt;&lt;P&gt;janet f 002 110 42 68 31 74 45&lt;/P&gt;&lt;P&gt;ross m 003 209 33 10 11 61 47&lt;/P&gt;&lt;P&gt;ross m 003 230 82 95 48 18 26&lt;/P&gt;&lt;P&gt;ross m 003 246 91 82 92 28 98&lt;/P&gt;&lt;P&gt;ross m 003 230 03 62 56 02 42&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How to achieve this using scala?&lt;/P&gt;</description>
      <pubDate>Wed, 08 Mar 2023 18:28:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8079#M3799</guid>
      <dc:creator>Sandesh87</dc:creator>
      <dc:date>2023-03-08T18:28:33Z</dc:date>
    </item>
    <item>
      <title>Re: parse and combine multiple datasets within a single file</title>
      <link>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8080#M3800</link>
      <description>&lt;P&gt;I would say don't use spark for data that has awful/no schemas.  Use spark for scale and data with a schema.  Maybe try to fix whatever is creating these messages.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Mar 2023 18:38:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8080#M3800</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-03-08T18:38:32Z</dc:date>
    </item>
    <item>
      <title>Re: parse and combine multiple datasets within a single file</title>
      <link>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8081#M3801</link>
      <description>&lt;P&gt;appreciate the feedback but i cant control what is coming through event hub. The message is just the way it is and cant be undone&lt;/P&gt;</description>
      <pubDate>Wed, 08 Mar 2023 18:58:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8081#M3801</guid>
      <dc:creator>Sandesh87</dc:creator>
      <dc:date>2023-03-08T18:58:07Z</dc:date>
    </item>
    <item>
      <title>Re: parse and combine multiple datasets within a single file</title>
      <link>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8082#M3802</link>
      <description>&lt;P&gt;Hi @Sandesh Puligundla​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! &lt;/P&gt;&lt;P&gt;Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Apr 2023 10:38:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parse-and-combine-multiple-datasets-within-a-single-file/m-p/8082#M3802</guid>
      <dc:creator>Vartika</dc:creator>
      <dc:date>2023-04-03T10:38:14Z</dc:date>
    </item>
  </channel>
</rss>

