- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 04:13 AM
I have a large dataframe (>1TB) I have to save in parquet format (not delta for this use case). When I save the dataframe using .format("parquet") it results in several parquet files. I want these files to be a specific size (ie not larger than 500Mb each). Is there a way to enforce that ?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 06:10 AM
Let's say you want the average partition size to be 400MB, then you can do:
(df.repartition(1024 * 1024 // 400)
.write.mode('overwrite')
.format('parquet')
.save('path/to/file'))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 06:10 AM
Let's say you want the average partition size to be 400MB, then you can do:
(df.repartition(1024 * 1024 // 400)
.write.mode('overwrite')
.format('parquet')
.save('path/to/file'))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 06:38 AM
Hi @erigaud Good day!
Whenever you are saving the data, you could pass the parquet.block.size config as an option:
Example:
spark.read.parquet("dbfs:/delta/delta-path/part-xxxx.snappy.parquet").write.mode("overwrite").option("parquet.block.size", 500).parquet("/tmp/vinay/parquet/blocksize1")

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-12-2023 11:44 PM
Hi @erigaud
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-13-2023 09:02 AM
In addition to the solutions provided above, we can also control the behavior by specifying maximum records per file if we have a rough estimate of how many records should be written to a file to reach 500 MB size.
df.write.option("maxRecordsPerFile", 1000000)

