cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

2021-08-Best-Practices-for-Your-Data-Architecture-v3-OG-1200x628

MadelynM
Databricks Employee
Databricks Employee

Thanks to everyone who joined the Best Practices for Your Data Architecture session on Optimizing Data Performance. You can access the on-demand session recording here and the pre-run performance benchmarks using the Spark UI Simulator

Proper cluster configuration plays an essential role in optimizing jobs for your data. Whether you're very comfortable with Apache Sparkโ„ข or just starting, our experts have best practices to help fine-tune your data pipeline performance. In the session, experts covered:

  • Proven strategies to configure clusters that help you identify and mitigate common data performance problems faced by application and data teams
  • How to reduce data processing time from hours to minutes based on common technical use cases

Posted below is a subset of the questions asked and answered throughout the session. Please feel free to ask follow-up questions or add comments as threads.

Q: What are the most common performance problems?

The "5 Ss" refers to the five most common performance problems that every developer needs to be aware of: Spill, Skew, Shuffle, Storage and Serialization. By developing a solid understanding of these problems, every developer is better equipped to diagnose and fix various performance problems.

Q: How much Spark do I really need to know?

Much less than you used to! Proper cluster configuration, VM selection, memory allocation, compute levels and general topology can play as important a role in optimizing a job for Apache Sparkโ„ข as can any other topic. Significant benefits of using a Delta cluster versus a Spark cluster for maximum performance given specific job requirements and while considering many other factors.

Q: If a crash occurs, is it possible to get to the source code so we can work out what might be going wrong? 

If your cluster is logging, you can look at the logs to find the errors and the errors usually show the explain plan. Otherwise, you can go to the cluster where you ran your query and click on sparkUI, and if it's a DF query, you can see that on the SQL tab of the corresponding job at the bottom under "details."

Q: When is my performance optimized "enough"?

The one optimization applicable to nearly every Spark job is the optimization and reduction of data ingestion. This session explored key ingestion concepts, including file formats, data formats, data storage strategies, and how they can all work together to maximize a job's performance.

Add your follow-up questions to threads!

For more, check out the Databricks Academy Self-Paced Course on Optimizing Apache Sparkโ„ข on Databricks included in the Databricks Academy Free Customer Learning.

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group