Databricks Community

dotan · ‎03-06-2023

I have a hive table in Delta format with over 1B rows, when I check the Data Explorer in the SQL section of Databricks it notes that the table size is 139.3GiB with 401 files but when I check the S3 bucket where the files are located (dbfs:/user/hive/warehouse/large_table) it's over 110TB and contains over 100K files.

Is it possible to reduce the size of the S3 bucket without losing any data in the table?

apingle · ‎03-06-2023

When you run updates, deletes etc on a delta table, new files are created. However, the old files are not automatically deleted. This is to allow for features like time travel on the Delta tables.

In order to delete older files for a delta table, you can use the vacuum command.

https://docs.databricks.com/sql/language-manual/delta-vacuum.html

View solution in original post

apingle · ‎03-06-2023

When you run updates, deletes etc on a delta table, new files are created. However, the old files are not automatically deleted. This is to allow for features like time travel on the Delta tables.

In order to delete older files for a delta table, you can use the vacuum command.

https://docs.databricks.com/sql/language-manual/delta-vacuum.html