cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

What are the best resources for learning how to tune/optimize Spark?

gaponte
New Contributor III

I know this question/topic is not very specific, but perhaps it asking the question would be useful for people other than me.

I am a newbie to Spark, and while I've been able to get my current model training and data transformations running, they are taking awfully long, and there are conditions that feel symptomatic of Spark not (yet) being properly optimized (by me) for what I'm doing (e.g. oftentimes there are executors sitting idle, often the last few tasks take forever compared to the first 99%, and other assorted issues).

Where is the best place to go to learn how to diagnose and fix Spark performance issues? I'm relatively confident that what I'm experiencing is not related to Databricks and based on my preliminary research it seems like Spark's performance can vary a LOT depending on whether it's been tuned properly for the use case at hand; I just don't know what the best/fastest approach is to becoming a Spark whisperer :-).

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Greg Aponte​ : There is no fastest way to become SPARK expert but would need a lot of dedication and hands on work to get there. I would recommend you to study all the forms of joins - like broadcast join, shuffle hash join, sort merge join. Essentially the number of shuffles need to be as less as possible and to achieve it you should learn the concepts of filtering, re-partition and coalesce. This can come in handy as well. Also please find a lot of youtube summit videos by Bricksters where they explain how to optimize spark codes. Finally, learn and understand what does every execution paramter in spark mean so that you can tweak it to best optmize the code.

Hope this helps! Happy learning! 🙂

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Greg Aponte​ : There is no fastest way to become SPARK expert but would need a lot of dedication and hands on work to get there. I would recommend you to study all the forms of joins - like broadcast join, shuffle hash join, sort merge join. Essentially the number of shuffles need to be as less as possible and to achieve it you should learn the concepts of filtering, re-partition and coalesce. This can come in handy as well. Also please find a lot of youtube summit videos by Bricksters where they explain how to optimize spark codes. Finally, learn and understand what does every execution paramter in spark mean so that you can tweak it to best optmize the code.

Hope this helps! Happy learning! 🙂

Anonymous
Not applicable

Hi @Greg Aponte​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.