Dataframe from SQL query glitches when grouping - what is going on !?!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-23-2025 10:25 PM
I have a query with some grouping. I'm using spark.sql to run that query.
skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')It displays as expected table.
This table I want to split into batches for processing, `rows_per_batch` in each batch
It displays some random garbage in `batch_id` column once grouped:
If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above
If I do select distinct, I will get garbage again:
The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.
skus_pdf = skus.toPandas()
skus = spark.createDataFrame(skus_pdf)Once I include this, everything starts working, no junk numbers.
So why spark dataframe from query fails to aggregate correctly?
I tried on both serverless and dedicated, same outcome.
Please someone advise