topic Re: Configuring average parquet file size in Data Engineering

Configuring average parquet file size

f2008700 — Mon, 05 Jun 2023 06:51:00 GMT

I have S3 as a data source containing sample TPC dataset (10G, 100G).

I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?

I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.

If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.

I've also tried setting the parameters (based off Google search)

```

df.write.option("maxRecordsPerFile", 6000000)

df.write.option("parquet.block.size", 256 * 1024 * 1024)

```

But that didn't help either. Any suggestions?

Re: Configuring average parquet file size

-werners- — Wed, 07 Jun 2023 08:32:01 GMT

Hi, you might wanna check this topic.

maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).

Re: Configuring average parquet file size

f2008700 — Wed, 07 Jun 2023 19:19:11 GMT

Thanks for pointing it out.

My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.

Re: Configuring average parquet file size

-werners- — Thu, 08 Jun 2023 07:28:06 GMT

I see.

That is not an easy one.

That depends on how big your data is, if it will be partitioned etc.

If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.

If you have very small data, a single file might be enough etc.

So the best way is to first explore your data, getting to know the data profile.

With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)

But be aware that there is no single optimal file size.

Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.

Re: Configuring average parquet file size

Anonymous — Sat, 10 Jun 2023 02:37:16 GMT

Hi @Vikas Goel

We haven't heard from you since the last response from @Werner Stinckens , and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Re: Configuring average parquet file size

Anonymous — Tue, 13 Jun 2023 14:38:21 GMT

@Vikas Goel :

Adding more pointers here-

If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:

# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
 
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
 
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
 
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")

Re: Configuring average parquet file size

f2008700 — Tue, 13 Jun 2023 21:31:16 GMT

Thanks!

Will give it a try.