Hello everyone,
I am facing an issue with writing 100–500 million rows (partitioned by a column) into a newly created Delta table. I have set up a cluster with 256 GB of memory and 64 cores. However, the following code takes a considerable amount of time even when writing around 70 million rows:
df.repartition(num_partitions*4, partition_col).write \
.format("delta") \
.mode("overwrite") \
.partitionBy(partition_col) \
.option("mergeSchema", "true") \
.option("optimizeBucket", "true") \
.option("maxRecordsPerFile", "1000000") \
.option("autoOptimize.optimizeWrite", "true") \
.option("autoOptimize.autoCompact", "true") \
.option("autoOptimize.autoRepartition", "true") \
.saveAsTable(f"{bronze_layer}.{table_name}")
What Do i need to adjust to speed up this written step?