Databricks Community

MadelynM · ‎11-08-2021

Thanks to everyone who joined the Best Practices for Your Data Architecture session on Optimizing Data Performance. You can access the on-demand session recording here and the pre-run performance benchmarks using the Spark UI Simulator.

Proper cluster configuration plays an essential role in optimizing jobs for your data. Whether you're very comfortable with Apache Spark™ or just starting, our experts have best practices to help fine-tune your data pipeline performance. In the session, experts covered:

Proven strategies to configure clusters that help you identify and mitigate common data performance problems faced by application and data teams
How to reduce data processing time from hours to minutes based on common technical use cases

Posted below is a subset of the questions asked and answered throughout the session. Please feel free to ask follow-up questions or add comments as threads.

Q: What are the most common performance problems?

The "5 Ss" refers to the five most common performance problems that every developer needs to be aware of: Spill, Skew, Shuffle, Storage and Serialization. By developing a solid understanding of these problems, every developer is better equipped to diagnose and fix various performance problems.

Q: How much Spark do I really need to know?

Much less than you used to! Proper cluster configuration, VM selection, memory allocation, compute levels and general topology can play as important a role in optimizing a job for Apache Spark™ as can any other topic. Significant benefits of using a Delta cluster versus a Spark cluster for maximum performance given specific job requirements and while considering many other factors.

Q: If a crash occurs, is it possible to get to the source code so we can work out what might be going wrong?

If your cluster is logging, you can look at the logs to find the errors and the errors usually show the explain plan. Otherwise, you can go to the cluster where you ran your query and click on sparkUI, and if it's a DF query, you can see that on the SQL tab of the corresponding job at the bottom under "details."

Q: When is my performance optimized "enough"?

The one optimization applicable to nearly every Spark job is the optimization and reduction of data ingestion. This session explored key ingestion concepts, including file formats, data formats, data storage strategies, and how they can all work together to maximize a job's performance.

Add your follow-up questions to threads!

For more, check out the Databricks Academy Self-Paced Course on Optimizing Apache Spark™ on Databricks included in the Databricks Academy Free Customer Learning.

Databricks Community

2021-08-Best-Practices-for-Your-Data-Architecture-v3-OG-1200x628

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples