Databricks Community

Dimitry · yesterday

I have a query with some grouping. I'm using spark.sql to run that query.

skus = spark.sql('with cte as (select... group by all) select *, .. from cte group by all')

It displays as expected table.

This table I want to split into batches for processing, `rows_per_batch` in each batch

It displays some random garbage in `batch_id` column once grouped:

If I dump `batch_id` on its own, it will display expected values 0 to 3. No big numbers like "1041204193" above

If I do select distinct, I will get garbage again:

The only solution, albeit I hope temporary, I found so far is to cast original dataset into Pandas and back.

skus_pdf = skus.toPandas() 
skus = spark.createDataFrame(skus_pdf)

Once I include this, everything starts working, no junk numbers.

So why spark dataframe from query fails to aggregate correctly?

I tried on both serverless and dedicated, same outcome.

Please someone advise

Coffee77 · 6 hours ago

If you want to get unique sequential "IDs" for aggregations/batches, use SQL Windows functions. Here is a basiic sample:

Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Dataframe from SQL query glitches when grouping - what is going on !?!