Databricks Community

tajinder123 · ‎03-27-2024

in spark, table1 is small and broadcasted and joined with table 2. output is stored in df1. again, table1 is required to join with table3 and output need to be stored in df2. do it need to be broadcasted again?

Kaniz · ‎03-29-2024

Hi @tajinder123, When you perform a join operation in Spark, it’s essential to consider the size of the DataFrames involved. Broadcast join is an optimization technique that comes in handy when joining a large DataFrame with a smaller one. Here’s how it works:

Initial Broadcast Join:
- You mentioned that table1 (a smaller DataFrame) was broadcasted and joined with table2, resulting in df1.
- In this step, Spark broadcasted the smaller DataFrame (table1) to all executors. Each executor kept this DataFrame in memory.
- The larger DataFrame (table2) was split and distributed across all executors.
- The join was performed without shuffling any data from the larger DataFrame because the necessary data for the join was colocated on every executor.
Subsequent Join with table3:
- Now you need to join table1 (again) with table3 and store the output in df2.
- The question is whether you should broadcast table1 again.
- The answer depends on the size of table1 and the available memory in Spark Drivers and Executors.
Considerations:
- If table1 can fit comfortably in memory (both in Spark Drivers and Executors), you can reuse the broadcasted version from df1.
- However, if table1 is too large to fit in memory, broadcasting it again would lead to out-of-memory errors.
- Remember that the broadcast join threshold can be configured using the spark.sql.autoBroadcastJoinThreshold property. You can adjust this threshold based on your memory availability.

In summary, if memory constraints allow, reuse the broadcasted table1 from df1 for the subsequent join with table3. Otherwise, consider other optimization techniques or adjust the broadcast threshold accordingly ¹ ².

Databricks Community

broadcasted table reuse

in spark, table1 is small and broadcasted and joined with table 2. output is stored in df1. again, table1 is required to join with table3 and output need to be stored in df2. do it need to be broadcasted again?

2024 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms

Join Us for an Exciting Community Social Event!

Mosaic AI: Build and deploy production-quality Compound AI Systems

Introducing AI/BI: Intelligent Analytics for Real-World Data

Introducing Databricks LakeFlow: A unified, intelligent solution for data engineering