cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Optimized Write vs Reparation, Which is recommended?

saipujari_spark
Databricks Employee
Databricks Employee

When streaming to a Delta table, both repartitioning on the partition column and optimized write can help to avoid small files.

Which is recommended between Delta Optimized Write vs Repartitioning?

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks
1 REPLY 1

saipujari_spark
Databricks Employee
Databricks Employee

 Optimized write is recommended over repartitioning for the below reasons.

* The key part of Optimized Writes is that it is an adaptive shuffle. If you have a streaming ingest use case and input data rates change over time, the adaptive shuffle will adjust itself accordingly to the incoming data rates across micro-batches. If you have code snippets where you coalesce(n) or repartition(n) just before you write out your stream, you can remove those lines.

* Databricks dynamically optimizes Spark partition sizes based on the actual data and attempts to write out 128 MB files for each table partition. This is an approximate size and can vary depending on dataset characteristics.

* Repartitioning on a partition column can result in partitions with varying sizes when there is data skew, this will result in not so optimized file sizes.

The bottom line is that Optimize write is no different than Repartitioning, To simple put Optimized write is a repartition where we pick the number of partitions in an adaptive and optimal way on the fly based on data.

Reference: https://docs.databricks.com/delta/optimizations/auto-optimize.html#auto-compaction

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group