Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Is there a way to permanenetly purge data in Databricks based on certain condition ?
Like, from a particular Databricks table, I want to permanently purge certain rows based on a specific condition e.g., WHERE <col1>="Val1" and <col2>="Val2"
I will assume you are talking about managed tables in Unity Catalog here, if thats not the case, let me know.
We can segregate this in two steps:
You can use a DELETE FROM SQL statement to remove rows that match your condition. For example:
DELETE FROM table_name WHERE col1 = 'Val1' AND col2 = 'Val2'
After running the delete command, the data files are retained a retention period (default is 7 days) to support time travel and rollback. In order to permanently purge the data you can run the VACUUM command. For example:
VACUUM table_name RETAIN 0 HOURS
This physically deletes unreferenced files immediately. CAUTION: Setting retention to zero hours disables time travel for those files and should only be used if you're certain the data must be irrecoverable
Worth mentioning: If you have Deletion Vectors enabled in the table, you may need to run a REORG TABLE command.