It seems you are submitting a job to an all-purpose cluster. If so, this is an anti-pattern.
The primary reasons:
1. Jobs submitted to AP compute are charged AP rates, typically 2x to 3x job rates in terms of DBUs (for same cluster spec)
2. No way to "prioritize" resources to your important job compared to (for example) a really expensive query a developer may submit to that same AP cluster before your job starts, which would reduce the amount of resources available to your job.
So, assuming data volume and code are same, maybe the times it works is when "other" processes on the cluster (adhoc queries, other jobs) are not also demanding resources from the cluster.
When you restart you are clearing out the memory, and assuming your job is the first one this time, it works well (and some other job/process may not).
This is why it has always been the recommendation to use job clusters for jobs.