Recommendations for performance tuning best practices on Databricks
We recommend also checking out this article from my colleague @Franco Patano on best practices for performance tuning on Databricks.
Performance tuning your workloads is an important step to take before putting your project into production to ensure you are getting the best performance and the lowest cost to help meet you save money and meet your SLAs.
When tuning on Databricks, it is important to follow the the framework illustrated in the diagram below:
- First, focus on the foundation, the file layout. Efficient file layout is the most important focus for any MPP system to 1) reduce the overhead from too many small files, 2) reduce or remove data skew, and 3) reduce the amount of data you are scanning and reading into the MPP system.
- Once you have optimized file layout, then you can optimize your code base to remove potential code bottlenecks.
- Finally, once you have optimized the files and the code, then you can fine-tune your workload by choosing the optimal cluster configuration for your workload.
Continued below