cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I reduce the size of a hive table's S3 bucket

dotan
New Contributor II

I have a hive table in Delta format with over 1B rows, when I check the Data Explorer in the SQL section of Databricks it notes that the table size is 139.3GiB with 401 files but when I check the S3 bucket where the files are located (dbfs:/user/hive/warehouse/large_table) it's over 110TB and contains over 100K files.

Is it possible to reduce the size of the S3 bucket without losing any data in the table?

1 ACCEPTED SOLUTION

Accepted Solutions

apingle
Contributor

When you run updates, deletes etc on a delta table, new files are created. However, the old files are not automatically deleted. This is to allow for features like time travel on the Delta tables.

In order to delete older files for a delta table, you can use the vacuum command.

https://docs.databricks.com/sql/language-manual/delta-vacuum.html

View solution in original post

2 REPLIES 2

apingle
Contributor

When you run updates, deletes etc on a delta table, new files are created. However, the old files are not automatically deleted. This is to allow for features like time travel on the Delta tables.

In order to delete older files for a delta table, you can use the vacuum command.

https://docs.databricks.com/sql/language-manual/delta-vacuum.html

dotan
New Contributor II

That's great, thanks. It reduced the size of the bucket from 110TB to 7TB

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.