โ06-04-2023 11:51 PM
I have S3 as a data source containing sample TPC dataset (10G, 100G).
I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?
I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.
If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.
I've also tried setting the parameters (based off Google search)
```
df.write.option("maxRecordsPerFile", 6000000)
df.write.option("parquet.block.size", 256 * 1024 * 1024)
```
But that didn't help either. Any suggestions?
โ06-07-2023 01:32 AM
Hi, you might wanna check this topic.
maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).
โ06-07-2023 12:19 PM
Thanks for pointing it out.
My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.
โ06-08-2023 12:28 AM
I see.
That is not an easy one.
That depends on how big your data is, if it will be partitioned etc.
If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.
If you have very small data, a single file might be enough etc.
So the best way is to first explore your data, getting to know the data profile.
With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)
But be aware that there is no single optimal file size.
Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.
โ06-13-2023 07:38 AM
@Vikas Goelโ :
Adding more pointers here-
If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:
# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")
โ06-13-2023 02:31 PM
Thanks!
Will give it a try.
โ06-09-2023 07:37 PM
Hi @Vikas Goelโ
We haven't heard from you since the last response from @Werner Stinckensโ โ, and I was checking back to see if her suggestions helped you.
Or else, If you have any solution, please share it with the community, as it can be helpful to others.
Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group