cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Save to parquet with fixed size

erigaud
Honored Contributor

I have a large dataframe (>1TB) I have to save in parquet format (not delta for this use case). When I save the dataframe using .format("parquet") it results in several parquet files. I want these files to be a specific size (ie not larger than 500Mb each). Is there a way to enforce that ? 

1 ACCEPTED SOLUTION

Accepted Solutions

dream
Contributor

Let's say you want the average partition size to be 400MB, then you can do:

(df.repartition(1024 * 1024 // 400)
    .write.mode('overwrite')
    .format('parquet')
    .save('path/to/file'))

View solution in original post

4 REPLIES 4

dream
Contributor

Let's say you want the average partition size to be 400MB, then you can do:

(df.repartition(1024 * 1024 // 400)
    .write.mode('overwrite')
    .format('parquet')
    .save('path/to/file'))

Vinay_M_R
Databricks Employee
Databricks Employee

Hi @erigaud  Good day!

Whenever you are saving the data, you could pass the parquet.block.size config as an option:

Example:

spark.read.parquet("dbfs:/delta/delta-path/part-xxxx.snappy.parquet").write.mode("overwrite").option("parquet.block.size", 500).parquet("/tmp/vinay/parquet/blocksize1")

Anonymous
Not applicable

Hi @erigaud 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Lakshay
Databricks Employee
Databricks Employee

In addition to the solutions provided above, we can also control the behavior by specifying maximum records per file if we have a rough estimate of how many records should be written to a file to reach 500 MB size.

df.write.option("maxRecordsPerFile", 1000000)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group