cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Dataframe from SQL query glitches when grouping - what is going on !?!

Dimitry
Contributor III

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

Dimitry_1-1763964698802.png

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

Dimitry_2-1763964772801.png

If I do select distinct, I will get garbage again:

Dimitry_3-1763964861816.png

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

Dimitry_4-1763964998951.png

So why spark dataframe from query fails to aggregate correctly? 

I tried on both serverless and dedicated, same outcome.

Please someone advise

 

 

 

 

 

 

 

 

1 REPLY 1

Coffee77
Contributor III

If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:

Coffee77_0-1763977102208.png

 

 

 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData