cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What should I be looking for when evaluating the performance of a Spark job?

User16826992666
Valued Contributor

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

1 ACCEPTED SOLUTION

Accepted Solutions

Srikanth_Gupta_
Valued Contributor

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans. 

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

๐Ÿ˜Ž Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

View solution in original post

1 REPLY 1

Srikanth_Gupta_
Valued Contributor

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans. 

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

๐Ÿ˜Ž Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group