cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to estimate dataframe size in bytes ?

William_Scardua
Valued Contributor

How guys,

How do I estimate the size in bytes from my dataframe (pyspark) ?

Have any ideia ?

Thank you

3 REPLIES 3

Dribka
New Contributor III

@William_Scardua estimating the size of a PySpark DataFrame in bytes can be achieved using the dtypes and storageLevel attributes. First, you can retrieve the data types of the DataFrame using df.dtypes. Then, you can calculate the size of each column based on its data type. Multiply the number of elements in each column by the size of its data type and sum these values across all columns to get an estimate of the DataFrame size in bytes. Additionally, you can check the storage level of the DataFrame using df.storageLevel to understand if it's persisted in memory or on disk, as this can affect the actual storage size. Keep in mind that this is an estimation and the actual memory usage may vary based on factors like compression and optimization. If you need a more precise measurement, consider using the pyspark.sql.functions library to calculate the size of individual columns and the overall DataFrame size.

BroData
New Contributor II

Hi @Dribka @William_Scardua 

import numpy
actual_size_of_each_columns=df.toPandas().memory_usage(deep=True).to_dict()
del actual_size_of_each_columns["Index"]
for key in actual_size_of_each_columns:
    print(f"Size of the Column `{key}` -> {actual_size_of_each_columns[key]} bytes")
print(f"\nSize of the DataFrame -> {numpy.sum(list(actual_size_of_each_columns.values()))} bytes")

This code can help you to find the actual size of each column and the DataFrame in memory. The output reflects the maximum memory usage, considering Spark's internal optimizations.

Enneagram1w2
New Contributor II

Unveil the Enneagram 1w9 mix: merging Type 1's perfectionism with Type 9's calm. Explore their key traits, hurdles, and development path.  https://www.enneagramzoom.com/EnneagramTypes/EnneagramType1/Enneagram1w2

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group