I asked this question a while ago where I explain the cluster that my team uses on databricks. To save you some time, we use an all-purpose Standard D2ads v6 with 8 gigs of ram and 2 cores cluster. We are facing an issue with the memory, which is pinpointed BUT the behavior of the cluster is non deterministic. Every day I receive a similar batch of data and I use a databricks job to ingest this data into a hive_metastore tables. However, some days it works fine and some days the job crashes with an OOM error during the first step. Sometimes restarting the cluster and re-running the cluster works like a charm and it ends so fast.
My question, and the thing I'm worried about, is why does this happen? Every day, same amount of data but different cluster behavior (as mentioned, some days it works, some days it doesn't but works well upon restart. Some days the cluster needs a couple of restarts until it starts working good). The thing why I ask is because I have to explain this somehow to my client and the client isn't eager to spend more money for a more powerful cluster because, well we can just restart it at the days when the job fails and keep it that way