Databricks Community

User16826992666 · ‎06-15-2021

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

Srikanth_Gupta_ · ‎06-16-2021

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans.

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

View solution in original post

Srikanth_Gupta_ · ‎06-16-2021

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans.

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

Databricks Community

What should I be looking for when evaluating the performance of a Spark job?

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!