I'm writing big dataframes into deltas in s3 buckets.
df.write\
.format("delta")\
.mode("append")\
.partitionBy(partitionColumns)\
.option("mergeSchema", "true")\
.save(target_path)
What are the best tips to improve performance of the write as it takes several minutes today to finish writing to s3.
using latest versions of clusters and Spark 3.4.0, python.
- What spark config parameters can improve the write? Should i try "spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled " ? how?
- Should i try all kind of parameters like "spark.hadoop.fs.s3a.impl.disable.cache" ?
- The dataframe is only partition by one column. Should i partition by more to parallelized the write? or it will not impact that?
- what else can i check?