cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Configuring average parquet file size

f2008700
New Contributor III

I have S3 as a data source containing sample TPC dataset (10G, 100G).

I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?

I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.

If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.

I've also tried setting the parameters (based off Google search)

```

df.write.option("maxRecordsPerFile", 6000000)

df.write.option("parquet.block.size", 256 * 1024 * 1024)

```

But that didn't help either. Any suggestions?

7 REPLIES 7

-werners-
Esteemed Contributor III

Hi, you might wanna check this topic.

maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).

f2008700
New Contributor III

Thanks for pointing it out.

My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.

-werners-
Esteemed Contributor III

I see.

That is not an easy one.

That depends on how big your data is, if it will be partitioned etc.

If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.

If you have very small data, a single file might be enough etc.

So the best way is to first explore your data, getting to know the data profile.

With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)

But be aware that there is no single optimal file size.

Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.

Anonymous
Not applicable

@Vikas Goel​ :

Adding more pointers here-

If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:

# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
 
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
 
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
 
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")

f2008700
New Contributor III

Thanks!

Will give it a try.

Kaniz
Community Manager
Community Manager

Hi @Vikas Goel​, Did you try?

Anonymous
Not applicable

Hi @Vikas Goel​ 

We haven't heard from you since the last response from @Werner Stinckens​ ​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.