Databricks Community

User16790091296 · ‎06-04-2021

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):

[Note: This list is not exhaustive]

Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also come with optimizations that are enhanced over what RDD’s and Datasets offer
Use DataFrames before UDF’s. Before building a custom UDF, check the pyspark.sql.functions (here). UDF’s cause Spark to have to deserialize, execute the UDF by row, then reserialize the RDD’s. Documentation.
Leverage MlLib for machine learning. MlLib’s models (here) are already optimized for distributed execution. Many Python and R models like the ones you find in PyPi or Cran, while advanced, require extra development for distributed execution.
Cache during ML Training. During the training of ML models or iterative computation like training, are good time to explicitly cache the data to the cluster using Cache(). Otherwise, Databricks is optimized to read from object storage and leverages DBIO Cache (link) to create lighting fast performance without having to explicitly cache.
Cache Warming for BI Tools. You can use DBIO Cache enabled and also pre cache frequently used tables on clusters that will be serving results to BI Tools.
CSV, JSON, Raw Data to Parquet/Delta. In the event the data is living in current state CSV or JSON, your first performance improvement will be to ETL the data into parquet/delta.
DataFrame FAQ’s. Here are some helpful development best practices and FAQ’s for working with DataFrames. Link.
Use Delta Tables. Take advantage of the out of the box performance and reliability features like data skipping, z-ordering, and optimized file management without having to account for those optimizations yourself.

Databricks Community

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions