Training models on big or small clusters

akc
New Contributor III

I have a workflow with a model which trains every sunday in Azure Databricks. Sometimes the workflow fails as the max wait time is exceeded (currently I am using 1200 seconds). To solve the problem I was thinking of either increasing the wait time or increasing the size of the cluster used.

This made me wonder, which is better (and cheaper) of the two options below:

  1. Train the model on a bigger and more expensive cluster which will hopefully reduce the time used
  2. Train the model on a smaller and cheaper cluster and then simply increase the wait time

Or is there a third and better solution?