I have a DLT pipeline that has been running for weeks. Now, trying to rerun the pipeline with the same code and same data fails. I've even tried updating the compute on the cluster to about 3x of what was previously working and it still fails with out of memory.
If I monitor the Ganglia metrics, right before failure, the memory usage on the cluster is just under 40GB. The total memory available to the cluster is 311GB.
I've inherited code that has grown organically over time. So, it's not as efficient as it could be. But it was working and now it's not. What can I do to fix this or how can I even debug this further to determine the root cause? I'm relatively new to Databricks and this is the first time I've had to debug something like this. I don't even know where to start outside of monitoring the logs and metrics.
Thanks,
bfridley