<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Records are missing while creating new data from one big dataframe using filter in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43370#M5741</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thanks for your reply.&lt;/P&gt;&lt;P&gt;There is no issue with the data, you can see line number 20 in my code, i have &lt;STRONG&gt;all_trans_df&lt;/STRONG&gt; which is created after reading the data from file and&amp;nbsp; sent it to this function.&lt;/P&gt;&lt;P&gt;we can see the data in that&amp;nbsp;&lt;STRONG&gt;all_trans_df&lt;/STRONG&gt; dataframe but not in the result dataframe.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Note: i have nearly 30 files and running in parallel using multithreading.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 04 Sep 2023 11:02:08 GMT</pubDate>
    <dc:creator>Policepatil</dc:creator>
    <dc:date>2023-09-04T11:02:08Z</dc:date>
    <item>
      <title>Records are missing while creating new data from one big dataframe using filter</title>
      <link>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43327#M5739</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have data in file like below&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Policepatil_1-1693806659492.png"&gt;&lt;img src="https://community.databricks.com/skins/images/057EB9D5BC389948ACA97597C8650404/responsive_peak/images/image_unmoderated.gif" alt="Policepatil_1-1693806659492.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Policepatil_3-1693806860560.png"&gt;&lt;img src="https://community.databricks.com/skins/images/057EB9D5BC389948ACA97597C8650404/responsive_peak/images/image_unmoderated.gif" alt="Policepatil_3-1693806860560.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have different types of row in my input file, column number 8 defines the type of the record.&lt;/P&gt;&lt;P&gt;In the above file we have 4 types of records 00 to 03&lt;/P&gt;&lt;P&gt;My requirement is:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;There will be multiple files in the source path, each file having nearly 1 million records&lt;/LI&gt;&lt;LI&gt;Read the files and create different dataframes based on record type using filter on original dataframe(all record type dataframe)&lt;/LI&gt;&lt;LI&gt;Based on mapping file select the column positions and map it to column name&lt;/LI&gt;&lt;LI&gt;Create dictionary of dataframes with record type is key and dataframe is the value&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;My code looks like below&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Policepatil_4-1693807544901.png"&gt;&lt;img src="https://community.databricks.com/skins/images/057EB9D5BC389948ACA97597C8650404/responsive_peak/images/image_unmoderated.gif" alt="Policepatil_4-1693807544901.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Issue is for some records are missing from result dataframes.&lt;/P&gt;&lt;P&gt;Example:&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;for id: 1836, record type:01 there should be 15 records but we get only 14. If you re run again, we will get same issue in another file for another id.&lt;/LI&gt;&lt;LI&gt;in the original dataframe: Total 18 rows are there for id:1836, out of 18, 15 are related to record type 01.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Policepatil_5-1693807898507.png"&gt;&lt;img src="https://community.databricks.com/skins/images/057EB9D5BC389948ACA97597C8650404/responsive_peak/images/image_unmoderated.gif" alt="Policepatil_5-1693807898507.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Below dataframe is after filter based on record type, in this dataframe we can see one record is missing. There should be 15 but we have only 14.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Policepatil_6-1693808066454.png"&gt;&lt;img src="https://community.databricks.com/skins/images/057EB9D5BC389948ACA97597C8650404/responsive_peak/images/image_unmoderated.gif" alt="Policepatil_6-1693808066454.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Why records are missing while filtering the records?&lt;/P&gt;</description>
      <pubDate>Mon, 04 Sep 2023 06:25:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43327#M5739</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2023-09-04T06:25:16Z</dc:date>
    </item>
    <item>
      <title>Re: Records are missing while creating new data from one big dataframe using filter</title>
      <link>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43370#M5741</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Thanks for your reply.&lt;/P&gt;&lt;P&gt;There is no issue with the data, you can see line number 20 in my code, i have &lt;STRONG&gt;all_trans_df&lt;/STRONG&gt; which is created after reading the data from file and&amp;nbsp; sent it to this function.&lt;/P&gt;&lt;P&gt;we can see the data in that&amp;nbsp;&lt;STRONG&gt;all_trans_df&lt;/STRONG&gt; dataframe but not in the result dataframe.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;Note: i have nearly 30 files and running in parallel using multithreading.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Sep 2023 11:02:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43370#M5741</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2023-09-04T11:02:08Z</dc:date>
    </item>
    <item>
      <title>Re: Records are missing while creating new data from one big dataframe using filter</title>
      <link>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43376#M5742</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;If i run again with same files sometimes records will be missed from same files of the previous run or records will be missed from different file.&lt;/P&gt;&lt;P&gt;Example:&lt;/P&gt;&lt;P&gt;run1: 1 record missing in file1, no issue with other files&lt;/P&gt;&lt;P&gt;run2: 1 record missing in file3 and file4, no issue with other files&lt;/P&gt;</description>
      <pubDate>Mon, 04 Sep 2023 11:07:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/records-are-missing-while-creating-new-data-from-one-big/m-p/43376#M5742</guid>
      <dc:creator>Policepatil</dc:creator>
      <dc:date>2023-09-04T11:07:10Z</dc:date>
    </item>
  </channel>
</rss>

