cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

2021-08-Best-Practices-for-Your-Data-Architecture-v3-OG-1200x628

MadelynM
New Contributor III
New Contributor III

Thanks to everyone who joined the Best Practices for Your Data Architecture session on Optimizing Data Performance. You can access the on-demand session recording here and the pre-run performance benchmarks using the Spark UI Simulator

Proper cluster configuration plays an essential role in optimizing jobs for your data. Whether you're very comfortable with Apache Spark™ or just starting, our experts have best practices to help fine-tune your data pipeline performance. In the session, experts covered:

  • Proven strategies to configure clusters that help you identify and mitigate common data performance problems faced by application and data teams
  • How to reduce data processing time from hours to minutes based on common technical use cases

Posted below is a subset of the questions asked and answered throughout the session. Please feel free to ask follow-up questions or add comments as threads.

Q: What are the most common performance problems?

The "5 Ss" refers to the five most common performance problems that every developer needs to be aware of: Spill, Skew, Shuffle, Storage and Serialization. By developing a solid understanding of these problems, every developer is better equipped to diagnose and fix various performance problems.

Q: How much Spark do I really need to know?

Much less than you used to! Proper cluster configuration, VM selection, memory allocation, compute levels and general topology can play as important a role in optimizing a job for Apache Spark™ as can any other topic. Significant benefits of using a Delta cluster versus a Spark cluster for maximum performance given specific job requirements and while considering many other factors.

Q: If a crash occurs, is it possible to get to the source code so we can work out what might be going wrong? 

If your cluster is logging, you can look at the logs to find the errors and the errors usually show the explain plan. Otherwise, you can go to the cluster where you ran your query and click on sparkUI, and if it's a DF query, you can see that on the SQL tab of the corresponding job at the bottom under "details."

Q: When is my performance optimized "enough"?

The one optimization applicable to nearly every Spark job is the optimization and reduction of data ingestion. This session explored key ingestion concepts, including file formats, data formats, data storage strategies, and how they can all work together to maximize a job's performance.

Add your follow-up questions to threads!

For more, check out the Databricks Academy Self-Paced Course on Optimizing Apache Spark™ on Databricks included in the Databricks Academy Free Customer Learning.

0 REPLIES 0
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!