topic Re: Non deterministic behavior from the cluster in Data Engineering

Non deterministic behavior from the cluster

NotCuriosAtAll — Tue, 27 Jan 2026 15:49:33 GMT

I asked this question a while ago where I explain the cluster that my team uses on databricks. To save you some time, we use an all-purpose Standard D2ads v6 with 8 gigs of ram and 2 cores cluster. We are facing an issue with the memory, which is pinpointed BUT the behavior of the cluster is non deterministic. Every day I receive a similar batch of data and I use a databricks job to ingest this data into a hive_metastore tables. However, some days it works fine and some days the job crashes with an OOM error during the first step. Sometimes restarting the cluster and re-running the cluster works like a charm and it ends so fast.

My question, and the thing I'm worried about, is why does this happen? Every day, same amount of data but different cluster behavior (as mentioned, some days it works, some days it doesn't but works well upon restart. Some days the cluster needs a couple of restarts until it starts working good). The thing why I ask is because I have to explain this somehow to my client and the client isn't eager to spend more money for a more powerful cluster because, well we can just restart it at the days when the job fails and keep it that way

Re: Non deterministic behavior from the cluster

MoJaMa — Wed, 28 Jan 2026 02:19:18 GMT

It seems you are submitting a job to an all-purpose cluster. If so, this is an anti-pattern.

The primary reasons:

1. Jobs submitted to AP compute are charged AP rates, typically 2x to 3x job rates in terms of DBUs (for same cluster spec)

2. No way to "prioritize" resources to your important job compared to (for example) a really expensive query a developer may submit to that same AP cluster before your job starts, which would reduce the amount of resources available to your job.

So, assuming data volume and code are same, maybe the times it works is when "other" processes on the cluster (adhoc queries, other jobs) are not also demanding resources from the cluster.

When you restart you are clearing out the memory, and assuming your job is the first one this time, it works well (and some other job/process may not).

This is why it has always been the recommendation to use job clusters for jobs.

Re: Non deterministic behavior from the cluster

pradeep_singh — Wed, 28 Jan 2026 05:43:38 GMT

How to explain it to the client -
The job is operating at the resource ceiling of a very small driver. Tiny, normal day‑to‑day differences (file layout, plan choice, GC timing) sometimes push it over the limit, which is why restarts occasionally “fix” it—the restart clears memory and changes runtime conditions.This is assuming no other workload/query is running on it .

As @MoJaMa suggested move to a job cluster . Upgrade to latest DBR if possible . Periodically optimize your target table to compact files assuming its in delta format .