<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Dataframe from SQL query glitches when grouping - what is going on !?! in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140072#M51343</link>
    <description>&lt;P&gt;If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Coffee77_0-1763977102208.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21922i9F122398C89D8D5A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Coffee77_0-1763977102208.png" alt="Coffee77_0-1763977102208.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 24 Nov 2025 09:38:54 GMT</pubDate>
    <dc:creator>Coffee77</dc:creator>
    <dc:date>2025-11-24T09:38:54Z</dc:date>
    <item>
      <title>Dataframe from SQL query glitches when grouping - what is going on !?!</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140045#M51337</link>
      <description>&lt;P&gt;I have a query with some grouping. I'm using spark.sql to run that query.&lt;/P&gt;&lt;PRE&gt;skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')&lt;/PRE&gt;&lt;P&gt;It displays as expected table.&lt;/P&gt;&lt;P&gt;This table I want to split into batches for processing, `&lt;SPAN&gt;rows_per_batch` in each batch&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;P&gt;It displays some random garbage in `batch_id` column once grouped:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_1-1763964698802.png" style="width: 691px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21914iF31AC306F95F32B6/image-dimensions/691x453?v=v2" width="691" height="453" role="button" title="Dimitry_1-1763964698802.png" alt="Dimitry_1-1763964698802.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_2-1763964772801.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21915i57240B95D4F49DCA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_2-1763964772801.png" alt="Dimitry_2-1763964772801.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;If I do select distinct, I will get garbage again:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_3-1763964861816.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21916i4883AB32D60883BD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Dimitry_3-1763964861816.png" alt="Dimitry_3-1763964861816.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.&lt;/P&gt;&lt;/DIV&gt;&lt;LI-CODE lang="markup"&gt;skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)&lt;/LI-CODE&gt;&lt;DIV&gt;&lt;P&gt;Once I include this, everything starts working, no junk numbers.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Dimitry_4-1763964998951.png" style="width: 595px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21917i6F5D21892B8CB052/image-dimensions/595x378?v=v2" width="595" height="378" role="button" title="Dimitry_4-1763964998951.png" alt="Dimitry_4-1763964998951.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;So why spark dataframe from query fails to aggregate correctly?&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried on both serverless and dedicated, same outcome.&lt;/P&gt;&lt;P&gt;Please someone advise&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Nov 2025 06:25:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140045#M51337</guid>
      <dc:creator>Dimitry</dc:creator>
      <dc:date>2025-11-24T06:25:05Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe from SQL query glitches when grouping - what is going on !?!</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140072#M51343</link>
      <description>&lt;P&gt;If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Coffee77_0-1763977102208.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21922i9F122398C89D8D5A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Coffee77_0-1763977102208.png" alt="Coffee77_0-1763977102208.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Nov 2025 09:38:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140072#M51343</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-11-24T09:38:54Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe from SQL query glitches when grouping - what is going on !?!</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140244#M51362</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/179536"&gt;@Coffee77&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Its not a problem with sql (I can generate batch id in other means) - it is a sudden problem with SPARK that may happen with my other existing queries out of nowhere. I don't understand how an existing field (that batch id) being visibly correct for my test set of records (which entirely fit on screen) becomes a random number in distinct or group-by operators. This frightens me that sql query suddenly becomes unreliable.&lt;/P&gt;&lt;P&gt;I can only recall vacuuming some of the tables prior to running the query... but how can it affect this so badly?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 02:27:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140244#M51362</guid>
      <dc:creator>Dimitry</dc:creator>
      <dc:date>2025-11-25T02:27:52Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe from SQL query glitches when grouping - what is going on !?!</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140267#M51368</link>
      <description>&lt;P&gt;Try to avoid&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;monotonically_increasing_id&lt;/STRONG&gt; function (The generated ID is guaranteed to be monotonically increasing and unique, &lt;STRONG&gt;but not consecutive&lt;/STRONG&gt;.). Otherwise, you are likely to get that weird behavior. I started using that function in my projects but I had to discard duue to similar issues. So, I use "SQL Windows Functions". That's why I pasted the above code. Is your issue arising without using that function?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 08:15:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140267#M51368</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-11-25T08:15:24Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe from SQL query glitches when grouping - what is going on !?!</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140279#M51369</link>
      <description>&lt;P&gt;Try to use this code customized in the way you need:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Coffee77_0-1764061670888.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21944i2C57053F7F03CE4C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Coffee77_0-1764061670888.png" alt="Coffee77_0-1764061670888.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Instead of using&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;monotonically_increasing_id&lt;/STRONG&gt; function directly, use &lt;STRONG&gt;row_number&lt;/STRONG&gt; over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to solve some similar issues in which I kept&amp;nbsp;&lt;STRONG&gt;monotonically_increasing_id&lt;/STRONG&gt; function&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 09:12:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-from-sql-query-glitches-when-grouping-what-is-going-on/m-p/140279#M51369</guid>
      <dc:creator>Coffee77</dc:creator>
      <dc:date>2025-11-25T09:12:21Z</dc:date>
    </item>
  </channel>
</rss>

