Databricks Community

720677 · ‎05-12-2023

I'm writing big dataframes into deltas in s3 buckets.

df.write\

.format("delta")\

.mode("append")\

.partitionBy(partitionColumns)\

.option("mergeSchema", "true")\

.save(target_path)

What are the best tips to improve performance of the write as it takes several minutes today to finish writing to s3.

using latest versions of clusters and Spark 3.4.0, python.

What spark config parameters can improve the write? Should i try "spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled " ? how?
Should i try all kind of parameters like "spark.hadoop.fs.s3a.impl.disable.cache" ?
The dataframe is only partition by one column. Should i partition by more to parallelized the write? or it will not impact that?
what else can i check?

Anonymous · ‎05-13-2023

@Pablo (Ariel) :

There are several ways to improve the performance of writing data to S3 using Spark. Here are some tips and recommendations:

Increase the size of the write buffer: By default, Spark writes data in 1 MB batches. You can increase the size of the write buffer to reduce the number of requests made to S3 and improve performance. You can set the buffer size using the configuration parameter spark.databricks.delta.logFileCommitBufferSize.
Use a faster S3 endpoint: If you are using a S3 bucket in a different region than your Databricks workspace, you can use a faster endpoint to improve write performance. You can set the fs.s3a.endpoint configuration parameter to the URL of the endpoint.
Use S3Guard: S3Guard is a feature of Hadoop that provides a consistent view of S3 data even when multiple writers are writing to the same bucket. You can enable S3Guard by setting the fs.s3a.metadatastore.impl configuration parameter to org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore
Use instance storage: If your Databricks cluster has instance storage, you can use it to write data to local disk before copying it to S3. This can improve performance by reducing network traffic. You can set the spark.databricks.delta.logStore. configuration parameter to local
Parallelize the write: Partitioning the DataFrame by more than one column can help parallelize the write and improve performance. However, the number of partitions should not exceed the number of available cores in your cluster. You can set the number of partitions using the repartition or coalesce methods.
Optimize your data: If your data has a lot of small files, you can use the spark.sql.files.maxRecordsPerFile configuration parameter to control the size of the output files.
Optimize your storage: You can optimize the storage format of your data to improve write performance. For example, using a columnar storage format like Parquet can reduce the amount of data that needs to be written to S3.

Regarding the specific configuration parameters you mentioned:

spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled: This parameter is used to enable the magic committer for all S3 buckets. The magic committer can improve write performance by reducing the number of S3 requests made during a write operation. However, this feature is only available for certain file systems and may not be compatible with Delta Lake.
spark.hadoop.fs.s3a.impl.disable.cache: This parameter is used to disable the S3A client cache. Disabling the cache can improve write performance by reducing the amount of memory used by the S3A client. However, this can also increase the number of requests made to S3.

Overall, it's recommended to experiment with different configuration parameters and settings to find the best combination for your specific use case.