<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Dataframe Count before and after write command do not match in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/100213#M40227</link>
    <description>&lt;P&gt;I just found out I was populating a column with random variables, these variables are filtered in a join...so at each write and count those numbers change&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":grinning_face_with_sweat:"&gt;😅&lt;/span&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 27 Nov 2024 10:44:14 GMT</pubDate>
    <dc:creator>Riccardo96</dc:creator>
    <dc:date>2024-11-27T10:44:14Z</dc:date>
    <item>
      <title>Dataframe Count before and after write command do not match</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/99993#M40161</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have noticed a strange behaviour in a notebook where I am developing. When I use the notebook to read a single file the notebook works correctly, but when I set it to read multiple files at once, using the option recursive lookup, I have noticed that when I perform a count before writing to the final table and after the write process, the two count do not matches (picture attached)&lt;/P&gt;&lt;P&gt;Thanks in advance to everyone able to help me!&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 25 Nov 2024 17:49:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/99993#M40161</guid>
      <dc:creator>Riccardo96</dc:creator>
      <dc:date>2024-11-25T17:49:25Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe Count before and after write command do not match</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/99999#M40162</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133740"&gt;@Riccardo96&lt;/a&gt;,&lt;/P&gt;
&lt;P class="p1"&gt;This behavior suggests that rows might be getting dropped or overwritten during the writing process, particularly when using the &lt;SPAN class="s1"&gt;replaceWhere&lt;/SPAN&gt; option with clustering or partitioning.&lt;/P&gt;
&lt;P class="p1"&gt;The &lt;SPAN class="s1"&gt;replaceWhere&lt;/SPAN&gt; option replaces data based on the specified condition (&lt;SPAN class="s1"&gt;year&lt;/SPAN&gt;, &lt;SPAN class="s1"&gt;month&lt;/SPAN&gt;, and &lt;SPAN class="s1"&gt;day&lt;/SPAN&gt;). If multiple files have overlapping data for the same day, some rows might get overwritten&lt;/P&gt;
&lt;P class="p1"&gt;You can debug this by running the before and after writing:&lt;/P&gt;
&lt;P class="p1"&gt;df_adobe_nav_utente.groupBy("year", "month", "day").count().show()&lt;/P&gt;</description>
      <pubDate>Mon, 25 Nov 2024 18:18:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/99999#M40162</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2024-11-25T18:18:58Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe Count before and after write command do not match</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/100200#M40220</link>
      <description>&lt;P&gt;'m working with databricks 15.4 LTS runtime&lt;/P&gt;&lt;P&gt;In this order the steps I did:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Count(*) on dataframe:&amp;nbsp;&lt;SPAN&gt;99228246 rows&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Group by on dataframe, grouping per year, month, day:&amp;nbsp;99486114 rows&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Group by on output table, grouping per year, month, day:&amp;nbsp;0 rows (empty)&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Another count(*) on previous dataframe:&amp;nbsp;100167165 rows&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Group by on output table, grouping per year, month, day: 100031507 rows&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Wed, 27 Nov 2024 09:11:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/100200#M40220</guid>
      <dc:creator>Riccardo96</dc:creator>
      <dc:date>2024-11-27T09:11:14Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe Count before and after write command do not match</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/100213#M40227</link>
      <description>&lt;P&gt;I just found out I was populating a column with random variables, these variables are filtered in a join...so at each write and count those numbers change&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":grinning_face_with_sweat:"&gt;😅&lt;/span&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Nov 2024 10:44:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-count-before-and-after-write-command-do-not-match/m-p/100213#M40227</guid>
      <dc:creator>Riccardo96</dc:creator>
      <dc:date>2024-11-27T10:44:14Z</dc:date>
    </item>
  </channel>
</rss>

