cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Configuring average parquet file size

f2008700
New Contributor III

I have S3 as a data source containing sample TPC dataset (10G, 100G).

I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?

I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.

If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.

I've also tried setting the parameters (based off Google search)

```

df.write.option("maxRecordsPerFile", 6000000)

df.write.option("parquet.block.size", 256 * 1024 * 1024)

```

But that didn't help either. Any suggestions?

6 REPLIES 6

-werners-
Esteemed Contributor III

Hi, you might wanna check this topic.

maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).

f2008700
New Contributor III

Thanks for pointing it out.

My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.

-werners-
Esteemed Contributor III

I see.

That is not an easy one.

That depends on how big your data is, if it will be partitioned etc.

If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.

If you have very small data, a single file might be enough etc.

So the best way is to first explore your data, getting to know the data profile.

With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)

But be aware that there is no single optimal file size.

Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.

Anonymous
Not applicable

@Vikas Goel​ :

Adding more pointers here-

If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:

# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
 
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
 
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
 
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")

f2008700
New Contributor III

Thanks!

Will give it a try.

Anonymous
Not applicable

Hi @Vikas Goel​ 

We haven't heard from you since the last response from @Werner Stinckens​ ​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group