11-23-2025 10:25 PM
I have a query with some grouping. I'm using spark.sql to run that query.
skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')It displays as expected table.
This table I want to split into batches for processing, `rows_per_batch` in each batch
It displays some random garbage in `batch_id` column once grouped:
If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above
If I do select distinct, I will get garbage again:
The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.
skus_pdf = skus.toPandas()
skus = spark.createDataFrame(skus_pdf)Once I include this, everything starts working, no junk numbers.
So why spark dataframe from query fails to aggregate correctly?
I tried on both serverless and dedicated, same outcome.
Please someone advise
11-24-2025 01:38 AM
If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:
11-24-2025 06:27 PM
Hi @Coffee77
Its not a problem with sql (I can generate batch id in other means) - it is a sudden problem with SPARK that may happen with my other existing queries out of nowhere. I don't understand how an existing field (that batch id) being visibly correct for my test set of records (which entirely fit on screen) becomes a random number in distinct or group-by operators. This frightens me that sql query suddenly becomes unreliable.
I can only recall vacuuming some of the tables prior to running the query... but how can it affect this so badly?
11-25-2025 12:15 AM
Try to avoid monotonically_increasing_id function (The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.). Otherwise, you are likely to get that weird behavior. I started using that function in my projects but I had to discard duue to similar issues. So, I use "SQL Windows Functions". That's why I pasted the above code. Is your issue arising without using that function?
11-25-2025 01:12 AM
Try to use this code customized in the way you need:
Instead of using monotonically_increasing_id function directly, use row_number over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to solve some similar issues in which I kept monotonically_increasing_id function