cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How often should I run OPTIMIZE on my Delta Tables?

User16826992666
Valued Contributor

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

It would depend on how frequently you update the table and how often you read it.

If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance improvements

But if you have a streaming pipeline for e.g. appending data to a delta table partitioned by day, you could enable delta.autoOptimize.optimizeWrite = true so that it attempts to write out 128 MB files for each table partition. You could have a separate job scheduled daily to run OPTIMIZE on all data excluding the current day

For e.g.

OPTIMIZE events
WHERE date < current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Schedule OPTIMIZE jobs in a way so that it won't create conflicts with other concurrent writes and subsequent reads could benefit from the compaction and zordering

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

It would depend on how frequently you update the table and how often you read it.

If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance improvements

But if you have a streaming pipeline for e.g. appending data to a delta table partitioned by day, you could enable delta.autoOptimize.optimizeWrite = true so that it attempts to write out 128 MB files for each table partition. You could have a separate job scheduled daily to run OPTIMIZE on all data excluding the current day

For e.g.

OPTIMIZE events
WHERE date < current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

Schedule OPTIMIZE jobs in a way so that it won't create conflicts with other concurrent writes and subsequent reads could benefit from the compaction and zordering