Dataframe from SQL query glitches when grouping - what is going on !?!

Dimitry
Valued Contributor

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

Dimitry_1-1763964698802.png

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

Dimitry_2-1763964772801.png

If I do select distinct, I will get garbage again:

Dimitry_3-1763964861816.png

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

Dimitry_4-1763964998951.png

So why spark dataframe from query fails to aggregate correctly? 

I tried on both serverless and dedicated, same outcome.

Please someone advise

 

 

 

 

 

 

 

 

Coffee77
Honored Contributor II

If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:

Coffee77_0-1763977102208.png

 

 

 


Lifelong Solution Architect Learner | Coffee & Data

Dimitry
Valued Contributor

Hi @Coffee77 

Its not a problem with sql (I can generate batch id in other means) - it is a sudden problem with SPARK that may happen with my other existing queries out of nowhere. I don't understand how an existing field (that batch id) being visibly correct for my test set of records (which entirely fit on screen) becomes a random number in distinct or group-by operators. This frightens me that sql query suddenly becomes unreliable.

I can only recall vacuuming some of the tables prior to running the query... but how can it affect this so badly? 

 

Coffee77
Honored Contributor II

Try to avoid monotonically_increasing_id function (The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.). Otherwise, you are likely to get that weird behavior. I started using that function in my projects but I had to discard duue to similar issues. So, I use "SQL Windows Functions". That's why I pasted the above code. Is your issue arising without using that function?


Lifelong Solution Architect Learner | Coffee & Data

Coffee77
Honored Contributor II

Try to use this code customized in the way you need:

Coffee77_0-1764061670888.png

Instead of using monotonically_increasing_id function directly, use row_number over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to solve some similar issues in which I kept monotonically_increasing_id function 


Lifelong Solution Architect Learner | Coffee & Data