Greetings @pooja_bhumandla , here are some helpful hints and tips.
Diagnosis
Your error indicates that a broadcast join operation is attempting to send ~64MB of data to executors, but the BlockManager cannot store it due to memory constraints. This commonly occurs in Structured Streaming with `foreachBatch` when Spark automatically decides to broadcast a DataFrame that exceeds available executor memory.
How Spark Decides on Broadcast Joins
Spark automatically performs a broadcast join when one side of the join meets these criteria:
- The estimated table size is below `spark.sql.autoBroadcastJoinThreshold` (default: **10MB**)
- The join type is compatible (INNER, CROSS, LEFT OUTER, RIGHT OUTER, LEFT SEMI, LEFT ANTI)
- Spark's cost-based optimizer determines it's the most efficient strategy
The problem is Spark's size estimator can underestimate actual DataFrame sizes, especially after transformations, leading to broadcast attempts that exceed actual memory capacity.
Recommended Solutions
1. Disable Auto-Broadcast for Streaming Jobs
Set the threshold to -1 to prevent automatic broadcasting:
```python
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
```
2. Use Join Hints to Force SortMergeJoin
Within your `foreachBatch` function, explicitly prevent broadcasting:
```python
from pyspark.sql.functions import broadcast
def process_batch(batch_df, batch_id):
# Use NO_BROADCAST_HASH hint
result = batch_df.join(other_df.hint("merge"), "key")
result.write.format("delta").mode("append").save(path)
```
3. Collect Table Statistics
Before joins, gather accurate statistics to improve Spark's size estimation:
```sql
ANALYZE TABLE your_table COMPUTE STATISTICS
```
4. Increase Executor Memory
If broadcasts are necessary, ensure sufficient memory:
```python
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.executor.memoryOverhead", "2g")
```
5. Cache Tables Strategically
If repeatedly joining with the same table in `foreachBatch`, cache it beforehand to stabilize memory estimates.
Best Practice for Streaming
For Structured Streaming jobs, disabling auto-broadcast (`-1`) is generally recommended since streaming micro-batches can have unpredictable sizes, and SortMergeJoin handles dynamic data volumes more reliably.
Hope this helps, Louis.