cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What should I be looking for when evaluating the performance of a Spark job?

User16826992666
Valued Contributor

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

1 ACCEPTED SOLUTION

Accepted Solutions

Srikanth_Gupta_
Valued Contributor

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans. 

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

View solution in original post

1 REPLY 1

Srikanth_Gupta_
Valued Contributor

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans. 

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.