I'm trying simply to overwrite data into delta table. The Table size is not really huge it has 50 Mil of rows and 1.9Gb in size.
For running this code I use various cluster configurations starting from 1 node cluster 64Gb 16 Vcpu and also I tried to set 3-5 worker node cluster each worker from 32-64Gb with 8-16 Vcpus each.
Also I tried to repartition from 8-512. Also without repartition. But each run I got total time from 5-8 min and it doesn't depend on scaling the cluster. E.g.
On 1 node general purpose 64Gb 16Vcpus and repartition(16) it takes 6 min.
On 1 node general purpose 126Gb 32Vcpus and repartition(32) it takes 5 min 40 sec.
On 3-5 node general purpose 64Gb 16Vcpus each node repartition(32-64) it takes 5 min .
As you can see the time almost the same. And I don't understand why?
total_df.repartition(32).write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.option("delta.autoOptimize.optimizeWrite", "true") \
.option("delta.autoOptimize.autoCompact", "true") \
.saveAsTable(table_name)
Could someone suggest how writing speed could be impoved? Or this is delta table's problem and it is not possible increase the speed. As it is strange as the dataset itself not so huge just ~2Gb