cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Size of output data increased 4 times average size.

adhi_databricks
New Contributor III

Hey guys,

We have a databricks job, which dumps data in S3 at daily level, and average size of file would be 60GB and file format is ORC, one inner join operation was taking more than 3hrs , when debugged the join was not auto-broadcasted and it was doing a sort-merge join by defualt where data skew lead to run the job for more than 3 hours.
Cluster config : 14.3 LTS r-fleet.4xlarge (Driver) and 2-50 workers r-fleet.2xlarge.

After we found it was not broadcasting the smaller table automatically, we explicitly mentioned to broadcast the join, and dumping the same in orc format.But this time it was 200gb of data dumped at daily level instead of 60GB on average.

Wondering why the data size have increased much, but still when we compared the older data (60GB) and this new dumped 200gb still seems to be same like row count is same and data values compared is also same.
By default its using the snappy compression codec to dump the data.

If someone have idea on this, Please let me know the reason behind this!

Thanks in advance!!

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now