Morning all, hope you can help as I've been stumped for weeks.
Question: have there been recent changes to the Databricks query engine, or Photon (etc) which may impact large sort operations?
I have a Jobs pipeline that runs a series of notebooks which compile data from a few different operating systems and merge it together. Each notebook has around 100 cells in, pure SQL right now with no Python. This pipeline has been running like clockwork with approx 1hr 45min run time, once weekly, for about a year with zero issues or failures... and then suddenly on 2nd June "stopped working".

Before you ask: there's no changes in the data volume, no data quality issues, no known changes in our compute settings, etc etc. From looking at the logs, it gets about 2/3rds of the way through the first notebook and, based on some back of the envelope maths around the 1 hour mark seems to suddenly and aggressively throttle the resources. We see that where below we've got the largest single transform in the job running perfectly fine with before and after times:

And then a few cells later, hit the hour mark, and resources appear to be rapidly throttled (with a 48 second operation exploding to 10 minutes) until eventually it stops working a few steps after this.

There are some chunky merge operations buried in the middle which involve cross referencing a 400m+ row with a 25m+ row table to create a cartesian monster of order of 12bn rows, before then netting the product down to just keep the most recent records. This is unfortunately due to the way the underlying operating system creates and handles its data, and is basically unavoidable (for purpose of this thread). So in practice this means my memory load rapidly expands for a brief moment to then shrink back down again.
A key point to note is that the cell that works fine before the 1 hour mark is identical code to the later steps that have started to rapidly slow down. They are all scanning the same large table to summarize individual balance types, writing to their own individual tables (1 in, many out). The cell that works fine is by far the largest transform in the workbook.

Worked example below just to help visualise what's going on here, as I appreciate the solution may be specific to how QUALIFY works over rn = PARTITION BY (...) followed by a rn=1 in the next step, or something else. I've experimented with pre-sorting Subtotal_Obj or the underlying join table and/or saving to staging tables to reduce the sort load from the QUALIFY PARTITION statement, but my guess here from a query execution plan perspective is even if I know for sure the join is absolutely ordered correctly out the box, the query engine is going to want to check it's in the right order before it applies the partition logic anyway. I know the below looks quirky, but the net result is 1:many on agreements:accounts and the need to reduce it down to 1:1 in the next step with a SUM over agreementID.

I've experimented with simplifying the code (e.g. I'm testing on running batches of AgreementIDs and just iterating through the data), but unfortunately due to the way this transform is a bit unavoidable. I'm also to be honest reticent to mess around with a working and reconciled (at great length!) dataset which had been working fine until a few weeks ago when the compute fairies came along to change it.
If we look at the peak memory loads, before vs after there's a massive change

Lastly...... it's worth noting that the query execution steps have changed from 4 to 6 for, again, the exact same query. This is a code base that hasn't changed for at least 6 months. Hence my theory it may be an engine change in the back end somewhere.
I did notice that the "Performance Optimisation" box appeared, defaulted on. I turned this off, and didn't get much luck. This is what makes me suspect there may be back end Databricks changes RE: Photon or query execution for which I've been impacted. The only run of this query I've managed to get working is from pure luck, as no settings or code had changed in between - this is key as it's not a code base issue and no data has changed, just a very very small incremental on previous weeks. There's no other particular loads running on our compute at the same time. No compute settings have been changed.
All suggestions welcome 😁
Tom