Databricks Community

Arun_tsr · ‎11-08-2022

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete. We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at the stage and there is no progress.

Debayan · ‎11-08-2022

Hi @Arun Balaji , Could you please provide the error message you are receiving?

Arun_tsr · ‎11-09-2022

Hi @Debayan Mukherjee , We don't receive any error. But it is writing several small files thereby increasing the runtime of job. We can't reduce the number of output files with any tuning configurations (We have tried using broadcast join, changing partition size, changing max records per file etc., )

Databricks Community

Spark SQL output multiple small files

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap