topic Re: My Spark SQL join is very slow - what can I do to speed it up? in Data Engineering

My Spark SQL join is very slow - what can I do to speed it up?

vida — Thu, 18 Jun 2015 01:00:27 GMT

It's taking 10-12 minutes - can I make it faster?

Re: My Spark SQL join is very slow - what can I do to speed it up?

vida — Thu, 18 Jun 2015 01:17:54 GMT

If one of your tables is very small, you can do a Broadcast Hash Join to speed up your join. There's notebook on the Databricks Guide on that - search for "BroadcastHashJoin" to find that notebook.

Otherwise, a join operation in Spark SQL does cause a shuffle of your data to have the data transferred over the network, which can be slow. If you query the joined table multiple times, you may consider:

1) Saving your joined table as it's own table.

2) Creating a temporary table that represents the joined table, and then caching that table.

This will prevent needing to shuffle the data and doing the join multiple times.

Re: My Spark SQL join is very slow - what can I do to speed it up?

richard1_558848 — Sun, 21 Jun 2015 07:25:04 GMT

I'm making join between Parquet DB stored on S3

but it's seems that anyway Spark try to read all the data as we not see better performance when changing the queries.

I need to continue to investigate this point because it's not yet clear.

Re: My Spark SQL join is very slow - what can I do to speed it up?

__rake — Thu, 25 Jun 2015 18:52:52 GMT

I think I'm experiencing something similar.

Not using S3 yet. But reading Parquet tables into DataFrames, trying tactics like

persist

coalesce

repartition

after reading from Parquet. Using HiveContext, if that matters. But I get the impression that it's ignoring my attempts to repartition and cache and always recomputing my queries from scratch.

I'm definitely still new at this, so not sure yet how to figure out what's really going on.

Re: My Spark SQL join is very slow - what can I do to speed it up?

MarcLimotte — Wed, 22 Jul 2015 16:40:48 GMT

@Vida Ha I just did "Run All" on a clone of the "BroadcastHashJoin" notebook, and it appears to have errors.

%sql ANALYZE TABLE my_small_table COMPUTE STATISTICS noscan

which appears right after "Configure a BroadcastHashJoin for a small table.", we get org.apache.spark.sql.catalyst.analysis.NoSuchTableException. I'm guessing this is b/c at this point in the notebook (on it's first run), my_small_table has only been created with registerTempTable; not with saveAsTable (which I think is required for ANALYZE).

If I run the code blocks lower down to do the saveAsTable and then come back here and rerun the "ANALYZE" step, I get:

Error in SQL statement: com.databricks.backend.daemon.driver.DriverLocal$SQLExecutionException: java.lang.UnsupportedOperationException: Analyze only works for Hive tables, but my_small_table is a LogicalRelation

Re: My Spark SQL join is very slow - what can I do to speed it up?

vida — Wed, 22 Jul 2015 17:57:52 GMT

Hi mlimotte,

This notebook ran for an older version of Spark - we will update it for Spark 1.4 in the next Databricks release. In general these tips still hold true:

1) If you want to figure out if a BroadcastHashJoin is happening or not - use %sql explain select

2) The analyze command is not needed for all table types now - such as when you create a table with ".saveAsTable"

-Vida

Re: My Spark SQL join is very slow - what can I do to speed it up?

DeenarToraskar — Wed, 18 Nov 2015 07:04:28 GMT

Hi Vida

>>The analyze command is not needed for all table types now - such as when you create a table with ".saveAsTable"

Does this mean statistics are available for any tables created with .saveAsTable (and hence they would be broadcast if their size is under the broadcast limit ?)

Re: My Spark SQL join is very slow - what can I do to speed it up?

DeenarToraskar — Wed, 18 Nov 2015 07:06:44 GMT

actually I got my answer from the BroadcastHashJoin workbook. (which is Yes)

Re: My Spark SQL join is very slow - what can I do to speed it up?

vida — Wed, 18 Nov 2015 18:00:36 GMT

Analyze is not needed with parquet tables that use the databricks parquet package. That is the default now when you use .saveAsTable(), but if you use a different output format - it's possible that analyze may not work yet.