We have a small cluster (Standard D2ads v6) with 8 gigs of ram and 2 cores. This is an all-purpose cluster and for some reason, the client demands to use this one for our ETL process. The ETL process is simple, the client drops parquet files in the blob storage and then a databricks job is scheduled everyday to read the files from the blob, save the content into a hive_metastore table and move the parquet files from the blob in an Archive location.
Currently the biggest table that we have has 66 millions of rows and it's getting enriched every day. In total, we have 7 tables but recently an issues started popping up. Occasionally, which is weird, the pipeline fails even though we receive a similar amount of data on a daily basis. For example, today it might fail but tomorrow it might finish without any issues and pretty fast. The failure message is: Run failed with error message; Could not reach driver of cluster xxx-xxxxx-xxxx
The Metrics tab shows me a 100% utilization of the memory and nearly 100% of the CPU. My code is mostly a spark code, except a few places where I use `.collect()` but this is on a small size (table with 7 rows). The thing that I'm confused about is, why does it fail occasionally and not all of the time if there's some memory/performance constraints regarding the compute? I tried to optimize the memory by clearing the cache but I still get fails from time to time.
Also to mention, the compute is used only by the job so there isn't any other computations on it.