Databricks Community

Nisha_Aggarwal · ‎07-31-2024

Hello Team,

At my end, we bring data from Kafka and through autoloader we ingest it in Bronze layer and further to silver and silver plus layer. Lately due to business changes, we need to delete specific data from the bronze and silver layer due to data protection act and governance side. How can I delete/update data in bronze especially when it is in streaming mode. Could you suggest the approach to go ahead with it?

holly · ‎07-31-2024

Assuming this is an on going request, there's a few ways you can tackle it:

Anti joins - you'll need to keep a table of identifiers where the associated data shouldn't be stored. Before writing out the batch you can do an `anti join` on the table so only the remainder are kept. Great if you can broadcast the table, but if not, only apply this to the silver table as it'll slow down streams that are hard to recover /retry if they crash.
Also consider how this ID table will get updated. If your table will get updated hourly, you'll need to update the broadcast table with the same frequency.

Using hashes if your ID is PII - Let's say for example that your ID is credit card number. You can't keep a table of credit card numbers because that's also PII (arguably). Instead store a hash of these values, then do the anti join of your values.

Assuming you're retroactively dropping data:

Delta's ACID compliance and optimistic concurrency means you should be able to drop values without rewriting data, which is especially true if you're using deletion vectors. Keep in mind the parquet files will still exist underneath until vacuum is run (manually or through Predictive Optimisation) so it's not truly deleted until that step. You can use the MERGE INTO syntax to make this happen.

Both

You could use of these approaches together as a two step process, but make sure you're not trying to update the same parquet files at the same time. ie, catch the first set of rows with an ANTI JOIN, then a day later delete any that slipped through due to operational lags in the lookup table.

This can be a tricky thing to design, and depending on your operational requirements this is going to look different for everyone. The easiest will be 'compliance within x days' compared to COPPA adherence which can be very strict. I highly recommend drawing it all out before starting any coding. I did this with a customer team and it took us two days of workshops to be happy with the design.

Nisha_Aggarwal · ‎08-01-2024

Hello,

Thankyou for your reply on my query!

My bronze layer has data in json format and currently, I need to remove 400 records from it. I also have job set up in streaming mode. Could you please suggest how I can go further with it?