Databricks Community

saipujari_spark · ‎09-14-2021

When streaming to a Delta table, both repartitioning on the partition column and optimized write can help to avoid small files.

Which is recommended between Delta Optimized Write vs Repartitioning?

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

saipujari_spark · ‎09-14-2021

Optimized write is recommended over repartitioning for the below reasons.

* The key part of Optimized Writes is that it is an adaptive shuffle. If you have a streaming ingest use case and input data rates change over time, the adaptive shuffle will adjust itself accordingly to the incoming data rates across micro-batches. If you have code snippets where you coalesce(n) or repartition(n) just before you write out your stream, you can remove those lines.

* Databricks dynamically optimizes Spark partition sizes based on the actual data and attempts to write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset characteristics.

* Repartitioning on a partition column can result in partitions with varying sizes when there is data skew, this will result in not so optimized file sizes.

The bottom line is that Optimize write is no different than Repartitioning, To simple put Optimized write is a repartition where we pick the number of partitions in an adaptive and optimal way on the fly based on data.

Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-compaction

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks