cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia): [Note: This list is not exhaustive] Leverage the DataFrame or Spar...

User16790091296
Contributor II

Some Tips & Tricks for Optimizing costs and performance (Clusters and Ganglia):

[Note: This list is not exhaustive]

  • Leverage the DataFrame or SparkSQL API’s first. They use the same execution process resulting in parity in performance but they also come with optimizations that are enhanced over what RDD’s and Datasets offer
  • Use DataFrames before UDF’s. Before building a custom UDF, check the pyspark.sql.functions (here). UDF’s cause Spark to have to deserialize, execute the UDF by row, then reserialize the RDD’s. Documentation. 
  • Leverage MlLib for machine learning. MlLib’s models (here) are already optimized for distributed execution. Many Python and R models like the ones you find in PyPi or Cran, while advanced, require extra development for distributed execution.
  • Cache during ML Training. During the training of ML models or iterative computation like training, are good time to explicitly cache the data to the cluster using Cache(). Otherwise, Databricks is optimized to read from object storage and leverages DBIO Cache (link) to create lighting fast performance without having to explicitly cache.  
  • Cache Warming for BI Tools. You can use DBIO Cache enabled and also pre cache frequently used tables on clusters that will be serving results to BI Tools. 
  • CSV, JSON, Raw Data to Parquet/Delta. In the event the data is living in current state CSV or JSON, your first performance improvement will be to ETL the data into parquet/delta.
  • DataFrame FAQ’s. Here are some helpful development best practices and FAQ’s for working with DataFrames. Link.
  • Use Delta Tables. Take advantage of the out of the box performance and reliability features like data skipping, z-ordering, and optimized file management without having to account for those optimizations yourself. 

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.