cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

broadcasted table reuse

tajinder123
New Contributor II

in spark, table1 is small and broadcasted and joined with table 2. output is stored in df1. again, table1 is required to join with table3 and output need to be stored in df2. do it need to be broadcasted again?

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @tajinder123, When you perform a join operation in Spark, itโ€™s essential to consider the size of the DataFrames involved. Broadcast join is an optimization technique that comes in handy when joining a large DataFrame with a smaller one. Hereโ€™s how it works:

  1. Initial Broadcast Join:

    • You mentioned that table1 (a smaller DataFrame) was broadcasted and joined with table2, resulting in df1.
    • In this step, Spark broadcasted the smaller DataFrame (table1) to all executors. Each executor kept this DataFrame in memory.
    • The larger DataFrame (table2) was split and distributed across all executors.
    • The join was performed without shuffling any data from the larger DataFrame because the necessary data for the join was colocated on every executor.
  2. Subsequent Join with table3:

    • Now you need to join table1 (again) with table3 and store the output in df2.
    • The question is whether you should broadcast table1 again.
    • The answer depends on the size of table1 and the available memory in Spark Drivers and Executors.
  3. Considerations:

    • If table1 can fit comfortably in memory (both in Spark Drivers and Executors), you can reuse the broadcasted version from df1.
    • However, if table1 is too large to fit in memory, broadcasting it again would lead to out-of-memory errors.
    • Remember that the broadcast join threshold can be configured using the spark.sql.autoBroadcastJoinThreshold property. You can adjust this threshold based on your memory availability.

In summary, if memory constraints allow, reuse the broadcasted table1 from df1 for the subsequent join with table3. Otherwise, consider other optimization techniques or adjust the broadcast threshold accordingly12.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group