Databricks Community

msj50 · ‎05-29-2015

My company urgently needs help, we are having severe performance problems with spark and are having to switch to a different solution if we don't get to the bottom of it.

We are on 1.3.1, using spark SQL, ORC Files with partitions and caching in memory, yet just a few users making 10 requests each seems to really slow our cluster down and we need to imminently to be able to handle many more requests.

We have tried increasing nodes, could/memory, stripe sizes, config changes etc to speed up our queries and getting nowhere. We urgently need any help people can offer. Happy to pay, we just need to understand better the limitations of Spark / Spark SQL so we can decide what we need to do.

Anonymous · ‎05-29-2015

@msj50 - can you share a URL to a notebook that we can look at to evaluate your performance issues?

Note this is a private comment that only you can see! Please respond with the same.

msj50 · ‎05-29-2015

Hi Pat, thanks for getting back to me, we aren't using the databricks cloud, but have our own cluster running on AWS. I created an earlier post with some additional information https://forums.databricks.com/questions/919/what-is-the-optimal-number-of-cores-i-should-use-f.html

We are pretty stuck at the moment so I would be grateful for any help you can provide.

Anonymous · ‎05-29-2015

@msj50 - if you can get running in Databricks Cloud, we can certainly help. You can sign-up here: http://go.databricks.com/register-for-dbc.

Otherwise, you may want to post your questions to the users mailing list at: user@spark.apache.org

msj50 · ‎05-29-2015

Unfortunately Databricks Cloud doesn't match our requirements at this point in time... I was wondering if Databricks provides any consultancy outside of that, or if you could recommend someone else with a sufficient level of expertise in Spark performance?

msj50 · ‎06-01-2015

anybody there?

vida · ‎06-02-2015

Hi,

The performance of your Spark queries is severely impacted by the way your underlying data is encoded. If you have a ton of files, sometimes the run time for your Spark job can entirely be dependent on the time it takes to read all of your files. Other times, if you have super large files in an unsplittable format, that can also bottleneck your job. Also, if you do certain queries and your data is heavily skewed towards only a few keys, that can make your job very slow too.

But in short - it's really hard to say exactly what is slowing down your jobs and what is going on without doing some diagnosis on what you are doing specifically.

happpy · ‎07-06-2015

@vida can you please guide what are the steps required to do the proper diagnostics to identify what is actually slowing down the Spark cache data retrieval.

Is there any official or non official help and support subscription available which i can buy to get some help?

If you have expertise in spark cache slow data retrieval diagnosis and treatment, please feel free to get in contact with me.

vida · ‎07-15-2015

@msj50, @happpy

I wish I had a neat checklist of things to check for performance, but there are too many potential issues that can cause slowness. These are the most common I've seen:

Too many files on too small of a cluster - if you have more a few thousand files and your files are not huge, consolidating them should be a performance improvement.
How many columns does your ORC files have? If you have a ton of columns (hundreds or more), and are doing a select * against your table, even if you are only returning a small number of rows, that can be slow.
Are you joining your tables and causing a shuffle of your data - if so, it's expected this will not be fast. Particularly, if your output files are unevenly sized, your shuffle will be bottlenecked on the slowest partition.
Are you trying to use Spark in place of a database for production serving purposes? While Spark is meant to be fast, it's not meant to replace the need for a production database. The best architecture for your system may be to use Spark to calculate your summary statistics, but then to write these statistics into a database for serving purposes.

As you can see - it's just really intricate what issue you may be facing.

Since you both asked about support - with a professional license of Databricks, we can diagnose and work through these issues with you and even advise on architecture level decisions for using Spark. Please email sales@databricks.com to inquire further.

pradeepyadagani · ‎09-15-2019

could you please state the work-around from each above bottlenecks? as these (files of various size, tables with high number of columns, Joins etc.) are very common use cases in data processing.

Marco · ‎11-02-2015

In my project, following solutions were launched one-by-one to improve performance

To store middle-level result, use memory cache instead of HDFS (like: Ignite Cache)
Only use spark for complicated data aggregation, to simple result, just do it on driver local.
Dump JDBC, use websocket communication instead of web API (Spark <-------> browsers)

Databricks Community

Spark Running Really slow - help required

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon