cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Partition Size:

subhas_hati
New Contributor

Hi

I have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB. 

Why the partition size after reading the file differ ?

 # Check the default partition size
partition_size = spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b","")
print(f"Partition Size: {partition_size} in bytes and {int(partition_size) / 1024 / 1024} in MB")

partition_size = (file_size)/1024/1024/(df.rdd.getNumPartitions())

1 REPLY 1

Sidhant07
Databricks Employee
Databricks Employee

Hi @subhas_hati ,

The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the spark.sql.files.openCostInBytes configuration.โ€ข spark.sql.files.maxPartitionBytes: This setting specifies the maximum number of bytes to pack into a single partition when reading files. The default is 128 MB.
 
โ€ข spark.sql.files.openCostInBytes: This internal configuration estimates the cost to open a file, measured by the number of bytes that could be scanned simultaneously. Its default value is 4 MB and it is added as an overhead to the partition size calculation.

The partition size calculation involves adding the spark.sql.files.openCostInBytes overhead to the total file size, which can lead to larger partition sizes than the default spark.sql.files.maxPartitionBytes setting. This is why the observed partition size can be 159 MB instead of the expected 128 MB.Sources:
1. spark.sql.files.maxPartitionBytes
2. spark.sql.files.openCostInBytes

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group