Size of output data increased 4 times average size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
Hey guys,
We have a databricks job, which dumps data in S3 at daily level, and average size of file would be 60GB and file format is ORC, one inner join operation was taking more than 3hrs , when debugged the join was not auto-broadcasted and it was doing a sort-merge join by defualt where data skew lead to run the job for more than 3 hours.
Cluster config : 14.3 LTS r-fleet.4xlarge (Driver) and 2-50 workers r-fleet.2xlarge.
After we found it was not broadcasting the smaller table automatically, we explicitly mentioned to broadcast the join, and dumping the same in orc format.But this time it was 200gb of data dumped at daily level instead of 60GB on average.
Wondering why the data size have increased much, but still when we compared the older data (60GB) and this new dumped 200gb still seems to be same like row count is same and data values compared is also same.
By default its using the snappy compression codec to dump the data.
If someone have idea on this, Please let me know the reason behind this!
Thanks in advance!!

