โ06-17-2015 06:00 PM
โ06-17-2015 06:17 PM
If one of your tables is very small, you can do a Broadcast Hash Join to speed up your join. There's notebook on the Databricks Guide on that - search for "BroadcastHashJoin" to find that notebook.
Otherwise, a join operation in Spark SQL does cause a shuffle of your data to have the data transferred over the network, which can be slow. If you query the joined table multiple times, you may consider:
1) Saving your joined table as it's own table.
2) Creating a temporary table that represents the joined table, and then caching that table.
This will prevent needing to shuffle the data and doing the join multiple times.
โ06-21-2015 12:25 AM
I'm making join between Parquet DB stored on S3
but it's seems that anyway Spark try to read all the data as we not see better performance when changing the queries.
I need to continue to investigate this point because it's not yet clear.
โ06-25-2015 11:52 AM
I think I'm experiencing something similar.
Not using S3 yet. But reading Parquet tables into DataFrames, trying tactics like
persist
, coalesce
, repartition
after reading from Parquet. Using HiveContext, if that matters. But I get the impression that it's ignoring my attempts to repartition and cache and always recomputing my queries from scratch.
I'm definitely still new at this, so not sure yet how to figure out what's really going on.
โ07-22-2015 09:40 AM
@Vida Haโ I just did "Run All" on a clone of the "BroadcastHashJoin" notebook, and it appears to have errors.
On
%sql ANALYZE TABLE my_small_table COMPUTE STATISTICS noscan
which appears right after "Configure a BroadcastHashJoin for a small table.", we get org.apache.spark.sql.catalyst.analysis.NoSuchTableException. I'm guessing this is b/c at this point in the notebook (on it's first run), my_small_table has only been created with registerTempTable; not with saveAsTable (which I think is required for ANALYZE).
If I run the code blocks lower down to do the saveAsTable and then come back here and rerun the "ANALYZE" step, I get:
Error in SQL statement: com.databricks.backend.daemon.driver.DriverLocal$SQLExecutionException: java.lang.UnsupportedOperationException: Analyze only works for Hive tables, but my_small_table is a LogicalRelation
โ07-22-2015 10:57 AM
Hi mlimotte,
This notebook ran for an older version of Spark - we will update it for Spark 1.4 in the next Databricks release. In general these tips still hold true:
1) If you want to figure out if a BroadcastHashJoin is happening or not - use %sql explain select
2) The analyze command is not needed for all table types now - such as when you create a table with ".saveAsTable"
-Vida
โ11-17-2015 11:04 PM
Hi Vida
>>The analyze command is not needed for all table types now - such as when you create a table with ".saveAsTable"
Does this mean statistics are available for any tables created with .saveAsTable (and hence they would be broadcast if their size is under the broadcast limit ?)
โ11-17-2015 11:06 PM
actually I got my answer from the BroadcastHashJoin workbook. (which is Yes)
โ11-18-2015 10:00 AM
Analyze is not needed with parquet tables that use the databricks parquet package. That is the default now when you use .saveAsTable(), but if you use a different output format - it's possible that analyze may not work yet.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group