cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to know actual size of delta and non-delta tables also the no of files actually exists on S3.

Shiva3
New Contributor III

I have set of delta and non-delta tables, their data is on AWS s3, I want to know the total size of my delta and non-delta table in actual excluding files belongs to operations DELETE, VACCUM etc. , also I need to know how much files each delta versions have, suppose in "operation Metrics" while running describe history, gives some details.

My need is to set up a retention policy and delete S3 data as much as possible because I have not deleted data since start of the project, no retention policy applied yet and now it is costly to have that much data.
Please feel free to suggest.

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Shiva3, To manage the size of Delta and non-Delta tables on AWS S3, excluding irrelevant files, start by using `DESCRIBE HISTORY` to monitor Delta table metrics and `VACUUM` to clean up old files, setting a retention period as needed. For non-Delta tables, leverage AWS S3 inventory reports or S3 Select to calculate file sizes. Set up S3 lifecycle policies for automated data management and monitor storage usage with AWS CloudWatch and Databricks metrics. Optimize Delta tables by compacting files with the `OPTIMIZE` command to ensure efficient storage and cost management. Let me know if you need any more help or if thereโ€™s anything else youโ€™d like to discuss!

Shiva3
New Contributor III

@Kaniz_Fatma Thank you for taking the time to address this issue.

We have observed that while running DESCRIBE HISTORY, there are instances where some Parquet files listed in the '_delta_log' JSON files are not physically present on S3. We need to identify which files are actually present on S3 and ensure they match the entries in the '_delta_log'.

Currently, our goal is to clean up unnecessary files from S3, both for Delta and non-Delta tables. We want to remove files that are not relevant or being used by any processes, as they are occupying significant space on S3. We have not used any retention policy yet from past 2-3 years.

Could you provide guidance on how to identify which files exist on S3 but are missing from the '_delta_log'  and vice versa ? Additionally, any advice on safely deleting these redundant files would be greatly appreciated.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group