Re: Performance Tuning Best Practices

isaac_gritz · ‎08-22-2022

Code Optimization - advice to avoid code bottlenecks

Avoid operations that result in Actions such as print, collect, and count in production pipelines. These operations force Spark to execute right away rather than pipelining multiple operations and determining the best query plan.
Avoid using RDDs whenever possible. The RDD API does not benefit from the performance optimizations of the Catalyst optimizer and many of the Databricks built-in runtime performance optimizations.
Avoid single-threaded Python or R workloads on multi-node clusters as these will result in only the driver node being utilized while the worker nodes remain idle. This includes using .toPandas().
1. For working with smaller datasets and single-threaded operations, we highly recommend using single node clusters (AWS | Azure | GCP) for cost optimization.
2. For working with large scale datasets, we recommend leveraging built-in spark functions as much as possible. When this is not possible we recommend using Pandas UDFs (AWS | Azure | GCP) to distribute existing python code using Spark. Alternatively, you can also leverage the Pandas API on Spark (AWS | Azure | GCP) which allows you to leverage 90% of the python Pandas functions while distributing the processing with Spark for scalability.