Dataframe from SQL query glitches when grouping - ...

Dimitry · ‎11-23-2025

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

If I do select distinct, I will get garbage again:

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

So why spark dataframe from query fails to aggregate correctly?

I tried on both serverless and dedicated, same outcome.

Please someone advise

Coffee77 · ‎11-24-2025

If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:

Lifelong Solution Architect Learner | Coffee & Data

Dimitry · ‎11-24-2025

Hi @Coffee77

Its not a problem with sql (I can generate batch id in other means) - it is a sudden problem with SPARK that may happen with my other existing queries out of nowhere. I don't understand how an existing field (that batch id) being visibly correct for my test set of records (which entirely fit on screen) becomes a random number in distinct or group-by operators. This frightens me that sql query suddenly becomes unreliable.

I can only recall vacuuming some of the tables prior to running the query... but how can it affect this so badly?

Coffee77 · ‎11-25-2025

Try to avoid monotonically_increasing_id function (The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.). Otherwise, you are likely to get that weird behavior. I started using that function in my projects but I had to discard duue to similar issues. So, I use "SQL Windows Functions". That's why I pasted the above code. Is your issue arising without using that function?

Lifelong Solution Architect Learner | Coffee & Data

Coffee77 · ‎11-25-2025

Try to use this code customized in the way you need:

Instead of using monotonically_increasing_id function directly, use row_number over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to solve some similar issues in which I kept monotonically_increasing_id function

Lifelong Solution Architect Learner | Coffee & Data

Dataframe from SQL query glitches when grouping - what is going on !?!