cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

User16790091296
Contributor II

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):

[Note: This list is not exhaustive]

  • Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also come with optimizations that are enhanced over what RDD’s and Datasets offer
  • Use DataFrames before UDF’s. Before building a custom UDF, check the pyspark.sql.functions (here). UDF’s cause Spark to have to deserialize, execute the UDF by row, then reserialize the RDD’s. Documentation. 
  • Leverage MlLib for machine learning. MlLib’s models (here) are already optimized for distributed execution. Many Python and R models like the ones you find in PyPi or Cran, while advanced, require extra development for distributed execution.
  • Cache during ML Training. During the training of ML models or iterative computation like training, are good time to explicitly cache the data to the cluster using Cache(). Otherwise, Databricks is optimized to read from object storage and leverages DBIO Cache (link) to create lighting fast performance without having to explicitly cache.  
  • Cache Warming for BI Tools. You can use DBIO Cache enabled and also pre cache frequently used tables on clusters that will be serving results to BI Tools. 
  • CSV, JSON, Raw Data to Parquet/Delta. In the event the data is living in current state CSV or JSON, your first performance improvement will be to ETL the data into parquet/delta.
  • DataFrame FAQ’s. Here are some helpful development best practices and FAQ’s for working with DataFrames. Link.
  • Use Delta Tables. Take advantage of the out of the box performance and reliability features like data skipping, z-ordering, and optimized file management without having to account for those optimizations yourself. 

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group