cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is it required to run OPTIMIZE after doing GDPR DELETEs?

cristianc
Contributor

Greetings,

I have been reading the excellent article from https://docs.databricks.com/security/privacy/gdpr-delta.html?_ga=2.130942095.1400636634.1649068106-1... and basically my question is if the GDPR DELETEs are performed on the table and that is the only change is it required to run OPTIMIZE ZSORT again on the table or the ZORDERing is maintained?

Thanks in advance for your help,

Cristian

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

After GDPR DELETE, please run VACUUM;

cristianc
Contributor

@Hubert Dudek​ thanks for the hint, exactly as written in the article VACUUM is required after the GDPR delete operation, however do we need to OPTIMIZE ZSORT again the table or is the ordering maintained?

Hubert-Dudek
Esteemed Contributor III

No, as it is not related to data storing but to performance. It is optimization after delete, but you don't need to do it after every delete. OPTIMIZE can run, for example, once per 24h as a night maintenance job.

Thanks again for answering.

In order to understand better the context imagine a really big table, that costs a lot to fully optimize zsort.

On this table we do GDPR deletes on certain partitions, potentially quite a lot of partitions, the question is if the only change to those partitions are the GDPR deletes is the OPTIMIZE ZSORT still required?

Optimization is never run fully on this table because of its size, we run it selectively only on changed partitions. For inserts and updates it is clear that the order is changing and the zsorting again is needed but what about deletes? Is delete also type of change that requires the optimization zsort to be executed again or not, and if so then why?

I did some more research and according to this entry https://docs.databricks.com/release-notes/runtime/10.4.html#insertion-order-tags-are-now-preserved-f... with DBR 10.4 LTS the zorder is kept in some cases:

"

The UPDATE and DELETE commands now preserve existing clustering information (including Z-ordering) for files that are updated or deleted. This is a best-effort approach and does not apply to cases when files are so small that they are combined during the update or delete.

"

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!