cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

broadcasted table reuse

tajinder123
New Contributor II

in spark, table1 is small and broadcasted and joined with table 2. output is stored in df1. again, table1 is required to join with table3 and output need to be stored in df2. do it need to be broadcasted again?

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @tajinder123, When you perform a join operation in Spark, itโ€™s essential to consider the size of the DataFrames involved. Broadcast join is an optimization technique that comes in handy when joining a large DataFrame with a smaller one. Hereโ€™s how it works:

  1. Initial Broadcast Join:

    • You mentioned that table1 (a smaller DataFrame) was broadcasted and joined with table2, resulting in df1.
    • In this step, Spark broadcasted the smaller DataFrame (table1) to all executors. Each executor kept this DataFrame in memory.
    • The larger DataFrame (table2) was split and distributed across all executors.
    • The join was performed without shuffling any data from the larger DataFrame because the necessary data for the join was colocated on every executor.
  2. Subsequent Join with table3:

    • Now you need to join table1 (again) with table3 and store the output in df2.
    • The question is whether you should broadcast table1 again.
    • The answer depends on the size of table1 and the available memory in Spark Drivers and Executors.
  3. Considerations:

    • If table1 can fit comfortably in memory (both in Spark Drivers and Executors), you can reuse the broadcasted version from df1.
    • However, if table1 is too large to fit in memory, broadcasting it again would lead to out-of-memory errors.
    • Remember that the broadcast join threshold can be configured using the spark.sql.autoBroadcastJoinThreshold property. You can adjust this threshold based on your memory availability.

In summary, if memory constraints allow, reuse the broadcasted table1 from df1 for the subsequent join with table3. Otherwise, consider other optimization techniques or adjust the broadcast threshold accordingly12.

 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!