โ07-12-2023 04:13 AM
I have a large dataframe (>1TB) I have to save in parquet format (not delta for this use case). When I save the dataframe using .format("parquet") it results in several parquet files. I want these files to be a specific size (ie not larger than 500Mb each). Is there a way to enforce that ?
โ07-12-2023 06:10 AM
Let's say you want the average partition size to be 400MB, then you can do:
(df.repartition(1024 * 1024 // 400)
.write.mode('overwrite')
.format('parquet')
.save('path/to/file'))
โ07-12-2023 06:10 AM
Let's say you want the average partition size to be 400MB, then you can do:
(df.repartition(1024 * 1024 // 400)
.write.mode('overwrite')
.format('parquet')
.save('path/to/file'))
โ07-12-2023 06:38 AM
Hi @erigaud Good day!
Whenever you are saving the data, you could pass the parquet.block.size config as an option:
Example:
spark.read.parquet("dbfs:/delta/delta-path/part-xxxx.snappy.parquet").write.mode("overwrite").option("parquet.block.size", 500).parquet("/tmp/vinay/parquet/blocksize1")
โ07-12-2023 11:44 PM
Hi @erigaud
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
โ07-13-2023 09:02 AM
In addition to the solutions provided above, we can also control the behavior by specifying maximum records per file if we have a rough estimate of how many records should be written to a file to reach 500 MB size.
df.write.option("maxRecordsPerFile", 1000000)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group