Databricks Community

erigaud · ‎07-12-2023

I have a large dataframe (>1TB) I have to save in parquet format (not delta for this use case). When I save the dataframe using .format("parquet") it results in several parquet files. I want these files to be a specific size (ie not larger than 500Mb each). Is there a way to enforce that ?

dream · ‎07-12-2023

Let's say you want the average partition size to be 400MB, then you can do:

(df.repartition(1024 * 1024 // 400)
    .write.mode('overwrite')
    .format('parquet')
    .save('path/to/file'))

View solution in original post

dream · ‎07-12-2023

Let's say you want the average partition size to be 400MB, then you can do:

(df.repartition(1024 * 1024 // 400)
    .write.mode('overwrite')
    .format('parquet')
    .save('path/to/file'))

Vinay_M_R · ‎07-12-2023

Hi @erigaud Good day!

Whenever you are saving the data, you could pass the parquet.block.size config as an option:

Example:

spark.read.parquet("dbfs:/delta/delta-path/part-xxxx.snappy.parquet").write.mode("overwrite").option("parquet.block.size", 500).parquet("/tmp/vinay/parquet/blocksize1")

Anonymous · ‎07-12-2023

Hi @erigaud

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Lakshay · ‎07-13-2023

In addition to the solutions provided above, we can also control the behavior by specifying maximum records per file if we have a rough estimate of how many records should be written to a file to reach 500 MB size.

df.write.option("maxRecordsPerFile", 1000000)

Databricks Community

Save to parquet with fixed size

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟