Databricks Community

Kaher · ‎06-28-2022

Rheiman · ‎06-29-2022

For general cluster decision making refer to this article https://docs.microsoft.com/en-gb/azure/databricks/clusters/cluster-config-best-practices

Once you've selected a cluster that makes sense, run it and check your ganglia metrics to see whether you need a compute, memory, or storage optimized cluster and then iterate from there.

To just see if your code works, starting with a small set of data on a single node is best practice.

View solution in original post

Ybaselto · ‎06-28-2022

Personnaly, once my data processing is optimize, i benchmark different setups to find the one that respect my process time goal for the less dbu. (Sorry for my english)

Rheiman · ‎06-29-2022

For general cluster decision making refer to this article https://docs.microsoft.com/en-gb/azure/databricks/clusters/cluster-config-best-practices

Once you've selected a cluster that makes sense, run it and check your ganglia metrics to see whether you need a compute, memory, or storage optimized cluster and then iterate from there.

To just see if your code works, starting with a small set of data on a single node is best practice.