Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):

[Note: This list is not exhaustive]

Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also come with optimizations that are enhanced over what RDD’s and Datasets offer
Use DataFrames before UDF’s. Before building a custom UDF, check the pyspark.sql.functions (here). UDF’s cause Spark to have to deserialize, execute the UDF by row, then reserialize the RDD’s. Documentation.
Leverage MlLib for machine learning. MlLib’s models (here) are already optimized for distributed execution. Many Python and R models like the ones you find in PyPi or Cran, while advanced, require extra development for distributed execution.
Cache during ML Training. During the training of ML models or iterative computation like training, are good time to explicitly cache the data to the cluster using Cache(). Otherwise, Databricks is optimized to read from object storage and leverages DBIO Cache (link) to create lighting fast performance without having to explicitly cache.
Cache Warming for BI Tools. You can use DBIO Cache enabled and also pre cache frequently used tables on clusters that will be serving results to BI Tools.
CSV, JSON, Raw Data to Parquet/Delta. In the event the data is living in current state CSV or JSON, your first performance improvement will be to ETL the data into parquet/delta.
DataFrame FAQ’s. Here are some helpful development best practices and FAQ’s for working with DataFrames. Link.
Use Delta Tables. Take advantage of the out of the box performance and reliability features like data skipping, z-ordering, and optimized file management without having to account for those optimizations yourself.