cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

optimizeWrite takes too long

svrdragon
New Contributor

Hi ,

 

We have a spark job write data in delta table for last 90 date partition. We have enabled spark.databricks.delta.autoCompact.enabled and delta.autoOptimize.optimizeWrite. Job takes 50 mins to complete. In that logic takes 12 mins and optimizewrite takes 37 mins to complete. Is any way to reduce total job time as the output per partition is 64mb file

 

We are using DBT 12 . 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @svrdragon, Itโ€™s great youโ€™re using Delta Lake features to optimize your Spark job. 

 

Letโ€™s explore some strategies to reduce the total job time potentially:

 

Optimize Write:

Partitioning:

  • Ensure that your Delta Lake partitioned tables are subject to write patterns that generate suboptimal (less than 128 MB) or non-standardized file sizes. Repartitioning data frames before writing them to disk can help.
  • If youโ€™re dealing with small batch SQL commands (e.g., UPDATE, DELETE, MERGE, CREATE TABLE AS SELECT, INSERT INTO), consider using Delta Lake partitioned tables.

Streaming Ingestion:

  • If your use case involves streaming data with an append pattern to Delta Lake partitioned tables, the extra write latency introduced by Optimize Write may be tolerable.
  • Evaluate whether the benefits of reduced file count and optimized file sizes outweigh the additional processing cost during writes.

Avoid Optimize Write:

  • If you have non-partitioned tables or well-defined optimization schedules, you might choose to avoid Optimize Write.
  • For large tables with specific read patterns, consider whether the extra write latency is acceptable.

VACUUM:

Remember to monitor the impact of these changes on both write performance and read efficiency. 

 

Adjustments may be necessary based on your specific workload characteristics.

 

Happy optimizing! ๐Ÿš€

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @svrdragon, Itโ€™s great youโ€™re using Delta Lake features to optimize your Spark job. 

 

Letโ€™s explore some strategies to reduce the total job time potentially:

 

Optimize Write:

Partitioning:

  • Ensure that your Delta Lake partitioned tables are subject to write patterns that generate suboptimal (less than 128 MB) or non-standardized file sizes. Repartitioning data frames before writing them to disk can help.
  • If youโ€™re dealing with small batch SQL commands (e.g., UPDATE, DELETE, MERGE, CREATE TABLE AS SELECT, INSERT INTO), consider using Delta Lake partitioned tables.

Streaming Ingestion:

  • If your use case involves streaming data with an append pattern to Delta Lake partitioned tables, the extra write latency introduced by Optimize Write may be tolerable.
  • Evaluate whether the benefits of reduced file count and optimized file sizes outweigh the additional processing cost during writes.

Avoid Optimize Write:

  • If you have non-partitioned tables or well-defined optimization schedules, you might choose to avoid Optimize Write.
  • For large tables with specific read patterns, consider whether the extra write latency is acceptable.

VACUUM:

Remember to monitor the impact of these changes on both write performance and read efficiency. 

 

Adjustments may be necessary based on your specific workload characteristics.

 

Happy optimizing! ๐Ÿš€

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.