cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

AWS S3 External Location Size in Unity Catalog

PrasSabb_97245
New Contributor II

Hi,

I am trying to get the raw size (total size)  of delta table. I could get delta table size from DeltaTable api but that gives only latest version size. I need to find the actual S3 size the tables takes on S3.

Is there any way, to find the S3 size on Unity catalog enabled workspace. I don't want to assign instance profile solely for this purpose. 

Appreciate any inputs / guidance.

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @PrasSabb_97245, Delta tables are a way of storing and processing data in Apache Spark™ using the Delta Lake format. They allow you to create tables that are schema-on-read, meaning that the schema is only defined when you query the table, not when you create it. This makes them more flexible and efficient for handling dynamic data.

 

One of the benefits of Delta tables is that they can be stored in various cloud storage services, such as Amazon S3, Azure data lake Storage, Google Cloud Storage, and more. You can use the cloud_files source in Auto Loader to ingest data from these services into your Delta tables. Auto Loader is a feature of Delta Live Tables (DLT), which is a framework for building reliable and maintainable data pipelines with Delta tables.

 

To get the raw size of a delta table in S3, you can use one of the following methods:

  • Use the spark.sql command to query the DeltaLog object for your table. The DeltaLog object contains information about the files and partitions of your delta table. You can use the snapshot.sizeInBytes attribute to get the total size of your delta table in bytes.

 

This method works for both Python and Scala notebooks.

 

  • Use the delta-lake-reader package to read your delta table from S3 into a Spark DataFrame. The delta-lake-reader package is a Python library that provides an easy way to access and manipulate delta tables stored in various cloud storage services. You can use the read_delta_table function to read your delta table from S3 into a Spark DataFrame.

This method works for Python notebooks.

 

I hope this helps you with your task. 

PrasSabb_97245
New Contributor II

Hi Kaniz,

Thank you for your suggestions. As per my understanding, the "snapshot.sizeInBytes" gives only current snapshot size. But I am looking for total size (all versions) of the table on S3. 

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.