Few things on top of my mind.
1) Check Spark UI and check which stage is taking more time.
2) Check for data skewing
3) Data skew can severely downgrade performance of queries, Spark SQL accepts skew hints in queries, also make sure to use proper join hints (Example using broadcast hint on smaller table while joining with large table)
4) Check Ganglia metrics if Databricks is used to see cluster resources utilization and make sure to use right node type (Like memory incensed, cpu incensed)
5) Try to avoid UDFs as much as we can
6) Spark uses a Cost Based Optimizer (CBO) to improve query plans. The CBO has many rule-based optimizations that require detailed and accurate statistics to plan optimally. Statistics help Spark understand cardinality, data distribution, min/max values and many more which enables Spark to choose optimal query execution plans.
7) Use DBIO Cache which accelerates data reads by caching remote data locally on instance storage.
๐ Smaller files in data lake can lead to processing overhead, Delta provides OPTIMIZE command to coalesce small files.
9) Try to avoid count() and collect() actions, use count() only if its necessary.