Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2023 12:28 AM
I see.
That is not an easy one.
That depends on how big your data is, if it will be partitioned etc.
If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.
If you have very small data, a single file might be enough etc.
So the best way is to first explore your data, getting to know the data profile.
With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)
But be aware that there is no single optimal file size.
Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.