cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Running Really slow - help required

msj50
New Contributor III

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it.

We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in memory, yet just a few users making 10 requests each seems to really slow our cluster down and we need to imminently to be able to handle many more requests.

We have tried increasing nodes, could/memory, stripe sizes, config changes etc to speed up our queries and getting nowhere. We urgently need any help people can offer. Happy to pay, we just need to understand better the limitations of Spark / Spark SQL so we can decide what we need to do.

10 REPLIES 10

Anonymous
Not applicable

@msj50 - can you share a URL to a notebook that we can look at to evaluate your performance issues?

Note this is a private comment that only you can see! Please respond with the same.

msj50
New Contributor III

Hi Pat, thanks for getting back to me, we aren't using the databricks cloud, but have our own cluster running on AWS. I created an earlier post with some additional information https://forums.databricks.com/questions/919/what-is-the-optimal-number-of-cores-i-should-use-f.html

We are pretty stuck at the moment so I would be grateful for any help you can provide.

Anonymous
Not applicable

@msj50 - if you can get running in Databricks Cloud, we can certainly help. You can sign-up here: http://go.databricks.com/register-for-dbc.

Otherwise, you may want to post your questions to the users mailing list at: user@spark.apache.org

msj50
New Contributor III

Unfortunately Databricks Cloud doesn't match our requirements at this point in time... I was wondering if Databricks provides any consultancy outside of that, or if you could recommend someone else with a sufficient level of expertise in Spark performance?

msj50
New Contributor III

anybody there?

vida
Databricks Employee
Databricks Employee

Hi,

The performance of your Spark queries is severely impacted by the way your underlying data is encoded. If you have a ton of files, sometimes the run time for your Spark job can entirely be dependent on the time it takes to read all of your files. Other times, if you have super large files in an unsplittable format, that can also bottleneck your job. Also, if you do certain queries and your data is heavily skewed towards only a few keys, that can make your job very slow too.

But in short - it's really hard to say exactly what is slowing down your jobs and what is going on without doing some diagnosis on what you are doing specifically.

happpy
New Contributor II

@vida can you please guide what are the steps required to do the proper diagnostics to identify what is actually slowing down the Spark cache data retrieval.

Is there any official or non official help and support subscription available which i can buy to get some help?

If you have expertise in spark cache slow data retrieval diagnosis and treatment, please feel free to get in contact with me.

vida
Databricks Employee
Databricks Employee

@msj50, @happpy​ 

I wish I had a neat checklist of things to check for performance, but there are too many potential issues that can cause slowness. These are the most common I've seen:

  • Too many files on too small of a cluster - if you have more a few thousand files and your files are not huge, consolidating them should be a performance improvement.
  • How many columns does your ORC files have? If you have a ton of columns (hundreds or more), and are doing a select * against your table, even if you are only returning a small number of rows, that can be slow.
  • Are you joining your tables and causing a shuffle of your data - if so, it's expected this will not be fast. Particularly, if your output files are unevenly sized, your shuffle will be bottlenecked on the slowest partition.
  • Are you trying to use Spark in place of a database for production serving purposes? While Spark is meant to be fast, it's not meant to replace the need for a production database. The best architecture for your system may be to use Spark to calculate your summary statistics, but then to write these statistics into a database for serving purposes.

As you can see - it's just really intricate what issue you may be facing.

Since you both asked about support - with a professional license of Databricks, we can diagnose and work through these issues with you and even advise on architecture level decisions for using Spark. Please email sales@databricks.com to inquire further.

could you please state the work-around from each above bottlenecks? as these (files of various size, tables with high number of columns, Joins etc.) are very common use cases in data processing.

Marco
New Contributor II

In my project, following solutions were launched one-by-one to improve performance

  • To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache)
  • Only use spark for complicated data aggregation, to simple result, just do it on driver local.
  • Dump JDBC, use websocket communication instead of web API (Spark <-------> browsers)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group