Hey @adhi_databricks , I did some digging and have come up with some helpful tips.
The significant increase in file size from 60GB to 200GB after implementing broadcast join, despite having identical data, is most likely caused by poor compression efficiency due to increased output partitioning and small ORC stripes.
Root Cause
When you switched to broadcast join, the output DataFrame likely ended up with 200 partitions (the default value of `spark.sql.shuffle.partitions`). While broadcast join eliminates shuffling during the join operation itself, the resulting DataFrame still inherits the partitioning configuration, which affects how data is written to ORC files.
With sort-merge join, your data was likely consolidated into fewer, larger partitions due to the shuffle operation. After switching to broadcast join, the data became distributed across many more partitions (potentially 200), causing Spark to write many small ORC files or create files with numerous small stripes.
Why This Causes Size Bloat
ORC compression efficiency heavily depends on stripe size. The default ORC stripe size is 67MB, and Snappy compression works best with larger, continuous data blocks. When data is fragmented across 200+ partitions:
- Each partition writes small amounts of data
- Small stripes (under 2-8MB) compress poorly compared to larger ones
- Compression overhead and metadata increase proportionally
- Dictionary encoding and other ORC optimizations become less effective
Solutions
Coalesce before writing:
```python
df.coalesce(10).write.format("orc").save(output_path)
```
Adjust shuffle partitions:
```python
spark.conf.set("spark.sql.shuffle.partitions", "10")
```
Configure ORC stripe size:
```python
spark.conf.set("spark.sql.orc.stripe.size", "134217728") # 128MB
```
The key is to consolidate your data into fewer, larger partitions before writing to ORC format. This allows ORC to create larger stripes that compress much more efficiently with Snappy, bringing your file size back down to the expected 60GB range.
Hope this helps, Louis.