Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Once you've selected a cluster that makes sense, run it and check your ganglia metrics to see whether you need a compute, memory, or storage optimized cluster and then iterate from there.
To just see if your code works, starting with a small set of data on a single node is best practice.
Personnaly, once my data processing is optimize, i benchmark different setโups to find the one that respect my process time goal for the less dbu. (Sorry for my english)
Once you've selected a cluster that makes sense, run it and check your ganglia metrics to see whether you need a compute, memory, or storage optimized cluster and then iterate from there.
To just see if your code works, starting with a small set of data on a single node is best practice.