Hi there, I've pulled together some thoughts based on experiences.
The short version of what you're hitting with the shuffle differences: the Databricks Runtime is not vanilla Spark with a fancy UI bolted on. It has a load of proprietary optimisations that fundamentally change how workloads behave, particularly shuffle-heavy ones like yours. So when you ask whether the differences are "purely a parallelism thing or whether the runtime itself handles operations differently" โ it's both, but honestly it's mostly the latter.
Specifically, Databricks has its own enhanced Adaptive Query Execution that dynamically coalesces shuffle partitions based on actual data sizes at runtime and converts joins on the fly when it spots skew (
docs). There's Low Shuffle Merge, enabled by default since DBR 10.4, which massively reduces shuffle volume for Delta MERGE operations and simply doesn't exist in vanilla Spark (
guide). If your cluster runs Photon, that's a native C++ execution engine replacing chunks of the JVM layer with completely different memory and shuffle characteristics. And there's custom off-heap memory management and spill behaviour that differs from vanilla Spark's unified memory model. All of which means your local single-node testing is telling you almost nothing useful about how memory pressure will look in production.
Now, that's not to say local testing is pointless โ far from it. For validating transformation logic, testing error handling, getting your code structure right, local hardware is perfect and you're right that it saves meaningful money during the exploratory phase. The mistake is extrapolating from "my code runs correctly locally" to "my code will perform well on Databricks." Those are different questions and local Spark can only answer the first one.
What I'd actually suggest looking at is
Databricks Connect. It lets you keep your local IDE workflow โ write and debug code on your machine โ but executes against an actual Databricks cluster. Pair it with a single-node cluster for dev work and you've got something that's cheap, gives you the real runtime optimisations, and doesn't require you to guess whether your local results will translate. The majority of customers I deal with don't bother with the local testing, the single node cluster is pretty cheap and saves repetition of work.
If you must stick with purely local testing, I'd just be disciplined about separating "does my code produce the right answer" from "how will this perform in production" in your thinking. Local can answer the first reliably. For the second, even a small Databricks cluster will tell you more in ten minutes than three weeks of local testing will.
I hope this gives some ideas.
Thanks,
Emma