cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How often should I run OPTIMIZE on my Delta Tables?

User16826992666
Valued Contributor

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

It would depend on how frequently you update the table and how often you read it.

If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance improvements

But if you have a streaming pipeline for e.g. appending data to a delta table partitioned by day, you could enable delta.autoOptimize.optimizeWrite = true so that it attempts to write out 128 MB files for each table partition. You could have a separate job scheduled daily to run OPTIMIZE on all data excluding the current day

For e.g.

OPTIMIZE events
WHERE date < current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Schedule OPTIMIZE jobs in a way so that it won't create conflicts with other concurrent writes and subsequent reads could benefit from the compaction and zordering

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

It would depend on how frequently you update the table and how often you read it.

If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance improvements

But if you have a streaming pipeline for e.g. appending data to a delta table partitioned by day, you could enable delta.autoOptimize.optimizeWrite = true so that it attempts to write out 128 MB files for each table partition. You could have a separate job scheduled daily to run OPTIMIZE on all data excluding the current day

For e.g.

OPTIMIZE events
WHERE date < current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Schedule OPTIMIZE jobs in a way so that it won't create conflicts with other concurrent writes and subsequent reads could benefit from the compaction and zordering

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.