Dataframe from SQL query glitches when grouping - what is going on !?!

Dimitry
Valued Contributor

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

Dimitry_1-1763964698802.png

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

Dimitry_2-1763964772801.png

If I do select distinct, I will get garbage again:

Dimitry_3-1763964861816.png

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

Dimitry_4-1763964998951.png

So why spark dataframe from query fails to aggregate correctly? 

I tried on both serverless and dedicated, same outcome.

Please someone advise