cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Historical Data Clean-up from Silver tables

Azure_Data_Bric
New Contributor III

Hi Everyone,

I need your help/suggestion.

We are using a DLT framework for our ELT process, data is received from the Source to the RAW layer in parquet format. This raw data is loaded to the Bronze layer which acts like a history table. From the BRONZE layer data is loaded to Silver using Merge login(upsert) in Delta format.

As part of cost savings, we cleaned up unwanted/old data from Silver tables(records older than 2017)

we used the delete query to delete these records and we ran VACCUME after clean-up to claim the free space.

we saw very good space savings after this activity.

But a few weeks later we observed above silver table's ADLS space was the same as before the cleanup 

But when I query the table I don't see the deleted records.

 

Can someone please help me what is the reason for the space increase and how to get free space permanently after the cleanup? 

 

sasikumar
6 REPLIES 6

Azure_Data_Bric
New Contributor III

Can someone please help on this

sasikumar

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Azure_Data_Bric , Did you run the Vacuum command on the same day as the delete query? Did you run the Vacuum command with default retention period of 7 days? And are you running vacuum command regularly or you have run it only once?

Hi Thanks a lot for the response.

I did run VACCUM with 0-hour retention(VACUUM <table_name> RETAIN 0 HOURS;) immedialty after deletion query.

Also, we have default VACCUM running every day once.

 

Please let me know if you need any more information.

sasikumar

Hi @Azure_Data_Bric , Thank you for responding. Are you also running optimize command before vacuum. If not, I would advise you to so.

Azure_Data_Bric
New Contributor III

Hi,

I see Optimize and VACUUM are running on all tables once per day automatically.

that day when we performed historical deletion, we deleted the data first, and then we just ran VACUUM with zero hour retention. 

After some time Optimize and VACUUM (with 7-day retention) ran automatically which is scheduled daily

after deletion activity space of all these tables is reduced by 80%.

when I checked after a week I saw all the space was occupied again. but when i query the table i don't see deleted records and new data addition also not so much, still i see same space is occupied these tables as before deletion activity.

 

Please help what could be the reason

 

sasikumar

Hi @Azure_Data_Bric , As you mentioned you are not seeing the deleted records, my understanding of the issue is that new data has been loaded in the table since the day the historical data was deleted. It also suggests to me that you are not performing delete operations after that day.

If this is the situation, then the only space we can free up(using vacuum operation) is the space taken up by stale files. Please note that when you deleted the historical data and ran the vacuum, a large amount of space was released because you had deleted lot of the data. 

In addition to this, we can also look for below 2 problems:

1. Over partitioning: If your table is over partitioned, then the optimize will not be able to generate a lot of stale files which the vacuum command clears later.

2. Look for any problem at the cloud provider side if the vacuum is actually deleting the data or not. Or is it facing any issue there.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.