How to know actual size of delta and non-delta tables also the no of files actually exists on S3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2024 05:18 AM
I have set of delta and non-delta tables, their data is on AWS s3, I want to know the total size of my delta and non-delta table in actual excluding files belongs to operations DELETE, VACCUM etc. , also I need to know how much files each delta versions have, suppose in "operation Metrics" while running describe history, gives some details.
My need is to set up a retention policy and delete S3 data as much as possible because I have not deleted data since start of the project, no retention policy applied yet and now it is costly to have that much data.
Please feel free to suggest.
- Labels:
-
Delta Lake
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2024 07:45 AM
@Retired_mod Thank you for taking the time to address this issue.
We have observed that while running DESCRIBE HISTORY, there are instances where some Parquet files listed in the '_delta_log' JSON files are not physically present on S3. We need to identify which files are actually present on S3 and ensure they match the entries in the '_delta_log'.
Currently, our goal is to clean up unnecessary files from S3, both for Delta and non-Delta tables. We want to remove files that are not relevant or being used by any processes, as they are occupying significant space on S3. We have not used any retention policy yet from past 2-3 years.
Could you provide guidance on how to identify which files exist on S3 but are missing from the '_delta_log' and vice versa ? Additionally, any advice on safely deleting these redundant files would be greatly appreciated.
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""