Partition Size:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-21-2025 03:28 PM
Hi
I have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB.
Why the partition size after reading the file differ ?
# Check the default partition size
partition_size = spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b","")
print(f"Partition Size: {partition_size} in bytes and {int(partition_size) / 1024 / 1024} in MB")
partition_size = (file_size)/1024/1024/(df.rdd.getNumPartitions())
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-30-2025 12:29 AM
Hi @subhas_hati ,
The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the spark.sql.files.openCostInBytes
configuration.• spark.sql.files.maxPartitionBytes
: This setting specifies the maximum number of bytes to pack into a single partition when reading files. The default is 128 MB.
• spark.sql.files.openCostInBytes
: This internal configuration estimates the cost to open a file, measured by the number of bytes that could be scanned simultaneously. Its default value is 4 MB and it is added as an overhead to the partition size calculation.
The partition size calculation involves adding the spark.sql.files.openCostInBytes
overhead to the total file size, which can lead to larger partition sizes than the default spark.sql.files.maxPartitionBytes
setting. This is why the observed partition size can be 159 MB instead of the expected 128 MB.Sources:
1. spark.sql.files.maxPartitionBytes
2. spark.sql.files.openCostInBytes

