Databricks Community

f2008700 · ‎06-04-2023

I have S3 as a data source containing sample TPC dataset (10G, 100G).

I want to convert that into parquet files with an average size of about ~256MiB. What configuration parameter can I use to set that?

I also need the data to be partitioned. And within each partition column, the files should be split based on the average size.

If I set the coalesce option `df.coalesce(1)`, then it only creates 1 file in each partition column.

I've also tried setting the parameters (based off Google search)

```

df.write.option("maxRecordsPerFile", 6000000)

df.write.option("parquet.block.size", 256 * 1024 * 1024)

```

But that didn't help either. Any suggestions?

-werners- · ‎06-07-2023

Hi, you might wanna check this topic.

maxPartitionBytes controls the file size but it is not a hard constraint (see the SO topic).

f2008700 · ‎06-07-2023

Thanks for pointing it out.

My ask if actually the opposite. I want to ensure that the basic size of parquet file is not too small, but rather more than a certain basic size/ or contains a minimum number of records.

-werners- · ‎06-08-2023

I see.

That is not an easy one.

That depends on how big your data is, if it will be partitioned etc.

If you use partitioning, it will be the cardinality of the partition column which decides the size of the output.

If you have very small data, a single file might be enough etc.

So the best way is to first explore your data, getting to know the data profile.

With coalesce and repartition you can define the number of partitions (files) that will be written. (but this has a cost of course, an extra shuffle)

But be aware that there is no single optimal file size.

Delta Lake is an option, which has some automated file optimizations. Definitely worth trying out.

Anonymous · ‎06-13-2023

@Vikas Goel :

Adding more pointers here-

If you want to ensure that the Parquet files have a minimum size or contain a minimum number of records, you can use the minRecordsPerFile option when writing the DataFrame to Parquet format. Here's how you can modify your code:

# Set the desired minimum number of records per Parquet file
min_records_per_file = 1000
 
# Calculate the maximum number of records per file based on the desired minimum file size
max_records_per_file = int((256 * 1024 * 1024) / (df.schema.jsonSize() + 100))
 
# Choose the maximum value between the desired minimum and calculated maximum
records_per_file = max(min_records_per_file, max_records_per_file)
 
# Write the DataFrame to Parquet format with the specified minimum records per file
df.write.option("minRecordsPerFile", records_per_file).parquet("s3://your_bucket/parquet_output_path")