Databricks

User16826992666 · ‎06-15-2021

Where do I start when starting performance tuning of my queries? Are there particular things I should be looking out for?

Srikanth_Gupta_ · ‎06-16-2021

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans.

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

View solution in original post

Srikanth_Gupta_ · ‎06-16-2021

Few things on top of my mind.

1) Check Spark UI and check which stage is taking more time.

2) Check for data skewing

3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)

4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)

5) Try to avoid UDFs as much as we can

6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans.

7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.

😎 Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.

9) Try to avoid count() and collect() actions, use count() only if its necessary.

Databricks

What should I be looking for when evaluating the performance of a Spark job?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs