cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Dataframe from SQL query glitches when grouping - what is going on !?!

Dimitry
Contributor III

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

Dimitry_1-1763964698802.png

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

Dimitry_2-1763964772801.png

If I do select distinct, I will get garbage again:

Dimitry_3-1763964861816.png

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

Dimitry_4-1763964998951.png

So why spark dataframe from query fails to aggregate correctly? 

I tried on both serverless and dedicated, same outcome.

Please someone advise

 

 

 

 

 

 

 

 

4 REPLIES 4

Coffee77
Contributor III

If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:

Coffee77_0-1763977102208.png

 

 

 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Dimitry
Contributor III

Hi @Coffee77 

Its not a problem with sql (I can generate batch id in other means) - it is a sudden problem with SPARK that may happen with my other existing queries out of nowhere. I don't understand how an existing field (that batch id) being visibly correct for my test set of records (which entirely fit on screen) becomes a random number in distinct or group-by operators. This frightens me that sql query suddenly becomes unreliable.

I can only recall vacuuming some of the tables prior to running the query... but how can it affect this so badly? 

 

Coffee77
Contributor III

Try to avoid monotonically_increasing_id function (The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.). Otherwise, you are likely to get that weird behavior. I started using that function in my projects but I had to discard duue to similar issues. So, I use "SQL Windows Functions". That's why I pasted the above code. Is your issue arising without using that function?


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Coffee77
Contributor III

Try to use this code customized in the way you need:

Coffee77_0-1764061670888.png

Instead of using monotonically_increasing_id function directly, use row_number over the previous result. This will ensure sequential "small" numbers. This was indeed the exact solution I used to solve some similar issues in which I kept monotonically_increasing_id function 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData