cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are the best resources for learning how to tune/optimize Spark?

gaponte
New Contributor III

I know this question/topic is not very specific, but perhaps it asking the question would be useful for people other than me.

I am a newbie to Spark, and while I've been able to get my current model training and data transformations running, they are taking awfully long, and there are conditions that feel symptomatic of Spark not (yet) being properly optimized (by me) for what I'm doing (e.g. oftentimes there are executors sitting idle, often the last few tasks take forever compared to the first 99%, and other assorted issues).

Where is the best place to go to learn how to diagnose and fix Spark performance issues? I'm relatively confident that what I'm experiencing is not related to Databricks and based on my preliminary research it seems like Spark's performance can vary a LOT depending on whether it's been tuned properly for the use case at hand; I just don't know what the best/fastest approach is to becoming a Spark whisperer :-).

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Greg Aponteโ€‹ : There is no fastest way to become SPARK expert but would need a lot of dedication and hands on work to get there. I would recommend you to study all the forms of joins - like broadcast join, shuffle hash join, sort merge join. Essentially the number of shuffles need to be as less as possible and to achieve it you should learn the concepts of filtering, re-partition and coalesce. This can come in handy as well. Also please find a lot of youtube summit videos by Bricksters where they explain how to optimize spark codes. Finally, learn and understand what does every execution paramter in spark mean so that you can tweak it to best optmize the code.

Hope this helps! Happy learning! ๐Ÿ™‚

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Greg Aponteโ€‹ : There is no fastest way to become SPARK expert but would need a lot of dedication and hands on work to get there. I would recommend you to study all the forms of joins - like broadcast join, shuffle hash join, sort merge join. Essentially the number of shuffles need to be as less as possible and to achieve it you should learn the concepts of filtering, re-partition and coalesce. This can come in handy as well. Also please find a lot of youtube summit videos by Bricksters where they explain how to optimize spark codes. Finally, learn and understand what does every execution paramter in spark mean so that you can tweak it to best optmize the code.

Hope this helps! Happy learning! ๐Ÿ™‚

Anonymous
Not applicable

Hi @Greg Aponteโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group