Databricks Community

William_Scardua · ‎11-28-2023

How guys,

How do I estimate the size in bytes from my dataframe (pyspark) ?

Have any ideia ?

Thank you

Dribka · ‎11-28-2023

@William_Scardua estimating the size of a PySpark DataFrame in bytes can be achieved using the dtypes and storageLevel attributes. First, you can retrieve the data types of the DataFrame using df.dtypes. Then, you can calculate the size of each column based on its data type. Multiply the number of elements in each column by the size of its data type and sum these values across all columns to get an estimate of the DataFrame size in bytes. Additionally, you can check the storage level of the DataFrame using df.storageLevel to understand if it's persisted in memory or on disk, as this can affect the actual storage size. Keep in mind that this is an estimation and the actual memory usage may vary based on factors like compression and optimization. If you need a more precise measurement, consider using the pyspark.sql.functions library to calculate the size of individual columns and the overall DataFrame size.

BroData · ‎03-16-2024

Hi @Dribka @William_Scardua

import numpy
actual_size_of_each_columns=df.toPandas().memory_usage(deep=True).to_dict()
del actual_size_of_each_columns["Index"]
for key in actual_size_of_each_columns:
    print(f"Size of the Column `{key}` -> {actual_size_of_each_columns[key]} bytes")
print(f"\nSize of the DataFrame -> {numpy.sum(list(actual_size_of_each_columns.values()))} bytes")

This code can help you to find the actual size of each column and the DataFrame in memory. The output reflects the maximum memory usage, considering Spark's internal optimizations.

Enneagram1w2 · ‎03-17-2024

Unveil the Enneagram 1w9 mix: merging Type 1's perfectionism with Type 9's calm. Explore their key traits, hurdles, and development path. https://www.enneagramzoom.com/EnneagramTypes/EnneagramType1/Enneagram1w2

Databricks Community

How to estimate dataframe size in bytes ?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon