Partition Size:

subhas_hati — Tue, 21 Jan 2025 23:28:04 GMT

I have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB.

Why the partition size after reading the file differ ?

# Check the default partition size
partition_size = spark.conf.get("spark.sql.files.maxPartitionBytes").replace("b","")
print(f"Partition Size: {partition_size} in bytes and {int(partition_size) / 1024 / 1024} in MB")

partition_size = (file_size)/1024/1024/(df.rdd.getNumPartitions())

Re: Partition Size:

Sidhant07 — Thu, 30 Jan 2025 08:29:11 GMT

Hi @subhas_hati ,

The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the spark.sql.files.openCostInBytes configuration.• spark.sql.files.maxPartitionBytes: This setting specifies the maximum number of bytes to pack into a single partition when reading files. The default is 128 MB.

• spark.sql.files.openCostInBytes: This internal configuration estimates the cost to open a file, measured by the number of bytes that could be scanned simultaneously. Its default value is 4 MB and it is added as an overhead to the partition size calculation.

The partition size calculation involves adding the spark.sql.files.openCostInBytes overhead to the total file size, which can lead to larger partition sizes than the default spark.sql.files.maxPartitionBytes setting. This is why the observed partition size can be 159 MB instead of the expected 128 MB.Sources:
1. spark.sql.files.maxPartitionBytes
2. spark.sql.files.openCostInBytes

topic Re: Partition Size: in Data Engineering

Partition Size:

Re: Partition Size: