cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Running Really slow - help required

msj50
New Contributor III

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it.

We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in memory, yet just a few users making 10 requests each seems to really slow our cluster down and we need to imminently to be able to handle many more requests.

We have tried increasing nodes, could/memory, stripe sizes, config changes etc to speed up our queries and getting nowhere. We urgently need any help people can offer. Happy to pay, we just need to understand better the limitations of Spark / Spark SQL so we can decide what we need to do.

11 REPLIES 11

Anonymous
Not applicable

@msj50 - can you share a URL to a notebook that we can look at to evaluate your performance issues?

Note this is a private comment that only you can see! Please respond with the same.

msj50
New Contributor III

Hi Pat, thanks for getting back to me, we aren't using the databricks cloud, but have our own cluster running on AWS. I created an earlier post with some additional information https://forums.databricks.com/questions/919/what-is-the-optimal-number-of-cores-i-should-use-f.html

We are pretty stuck at the moment so I would be grateful for any help you can provide.

Anonymous
Not applicable

@msj50 - if you can get running in Databricks Cloud, we can certainly help. You can sign-up here: http://go.databricks.com/register-for-dbc.

Otherwise, you may want to post your questions to the users mailing list at: user@spark.apache.org

msj50
New Contributor III

Unfortunately Databricks Cloud doesn't match our requirements at this point in time... I was wondering if Databricks provides any consultancy outside of that, or if you could recommend someone else with a sufficient level of expertise in Spark performance?

msj50
New Contributor III

anybody there?

vida
Contributor II
Contributor II

Hi,

The performance of your Spark queries is severely impacted by the way your underlying data is encoded. If you have a ton of files, sometimes the run time for your Spark job can entirely be dependent on the time it takes to read all of your files. Other times, if you have super large files in an unsplittable format, that can also bottleneck your job. Also, if you do certain queries and your data is heavily skewed towards only a few keys, that can make your job very slow too.

But in short - it's really hard to say exactly what is slowing down your jobs and what is going on without doing some diagnosis on what you are doing specifically.

happpy
New Contributor II

@vida can you please guide what are the steps required to do the proper diagnostics to identify what is actually slowing down the Spark cache data retrieval.

Is there any official or non official help and support subscription available which i can buy to get some help?

If you have expertise in spark cache slow data retrieval diagnosis and treatment, please feel free to get in contact with me.

vida
Contributor II
Contributor II

@msj50, @happpy​ 

I wish I had a neat checklist of things to check for performance, but there are too many potential issues that can cause slowness. These are the most common I've seen:

  • Too many files on too small of a cluster - if you have more a few thousand files and your files are not huge, consolidating them should be a performance improvement.
  • How many columns does your ORC files have? If you have a ton of columns (hundreds or more), and are doing a select * against your table, even if you are only returning a small number of rows, that can be slow.
  • Are you joining your tables and causing a shuffle of your data - if so, it's expected this will not be fast. Particularly, if your output files are unevenly sized, your shuffle will be bottlenecked on the slowest partition.
  • Are you trying to use Spark in place of a database for production serving purposes? While Spark is meant to be fast, it's not meant to replace the need for a production database. The best architecture for your system may be to use Spark to calculate your summary statistics, but then to write these statistics into a database for serving purposes.

As you can see - it's just really intricate what issue you may be facing.

Since you both asked about support - with a professional license of Databricks, we can diagnose and work through these issues with you and even advise on architecture level decisions for using Spark. Please email sales@databricks.com to inquire further.

could you please state the work-around from each above bottlenecks? as these (files of various size, tables with high number of columns, Joins etc.) are very common use cases in data processing.

Marco
New Contributor II

In my project, following solutions were launched one-by-one to improve performance

  • To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache)
  • Only use spark for complicated data aggregation, to simple result, just do it on driver local.
  • Dump JDBC, use websocket communication instead of web API (Spark <-------> browsers)

Kaniz
Community Manager
Community Manager

Hi @msj50 , Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.