cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark SQL output multiple small files

Arun_tsr
New Contributor III

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete. We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at the stage and there is no progress.

Spark UI metrics

2 REPLIES 2

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi @Arun Balaji​ , Could you please provide the error message you are receiving?

Arun_tsr
New Contributor III

Hi @Debayan Mukherjee​ , We don't receive any error. But it is writing several small files thereby increasing the runtime of job. We can't reduce the number of output files with any tuning configurations (We have tried using broadcast join, changing partition size, changing max records per file etc., )

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!