cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is it required to run OPTIMIZE after doing GDPR DELETEs?

cristianc
Contributor

Greetings,

I have been reading the excellent article from https://docs.databricks.com/security/privacy/gdpr-delta.html?_ga=2.130942095.1400636634.1649068106-1... and basically my question is if the GDPR DELETEs are performed on the table and that is the only change is it required to run OPTIMIZE ZSORT again on the table or the ZORDERing is maintained?

Thanks in advance for your help,

Cristian

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

After GDPR DELETE, please run VACUUM;

cristianc
Contributor

@Hubert Dudekโ€‹ thanks for the hint, exactly as written in the article VACUUM is required after the GDPR delete operation, however do we need to OPTIMIZE ZSORT again the table or is the ordering maintained?

Hubert-Dudek
Esteemed Contributor III

No, as it is not related to data storing but to performance. It is optimization after delete, but you don't need to do it after every delete. OPTIMIZE can run, for example, once per 24h as a night maintenance job.

Thanks again for answering.

In order to understand better the context imagine a really big table, that costs a lot to fully optimize zsort.

On this table we do GDPR deletes on certain partitions, potentially quite a lot of partitions, the question is if the only change to those partitions are the GDPR deletes is the OPTIMIZE ZSORT still required?

Optimization is never run fully on this table because of its size, we run it selectively only on changed partitions. For inserts and updates it is clear that the order is changing and the zsorting again is needed but what about deletes? Is delete also type of change that requires the optimization zsort to be executed again or not, and if so then why?

I did some more research and according to this entry https://docs.databricks.com/release-notes/runtime/10.4.html#insertion-order-tags-are-now-preserved-f... with DBR 10.4 LTS the zorder is kept in some cases:

"

The UPDATE and DELETE commands now preserve existing clustering information (including Z-ordering) for files that are updated or deleted. This is a best-effort approach and does not apply to cases when files are so small that they are combined during the update or delete.

"

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group