Databricks Community

drag7ter · ‎04-08-2025

I'm trying simply to overwrite data into delta table. The Table size is not really huge it has 50 Mil of rows and 1.9Gb in size.

For running this code I use various cluster configurations starting from 1 node cluster 64Gb 16 Vcpu and also I tried to set 3-5 worker node cluster each worker from 32-64Gb with 8-16 Vcpus each.

Also I tried to repartition from 8-512. Also without repartition. But each run I got total time from 5-8 min and it doesn't depend on scaling the cluster. E.g.
On 1 node general purpose 64Gb 16Vcpus and repartition(16) it takes 6 min.

On 1 node general purpose 126Gb 32Vcpus and repartition(32) it takes 5 min 40 sec.

On 3-5 node general purpose 64Gb 16Vcpus each node repartition(32-64) it takes 5 min .

As you can see the time almost the same. And I don't understand why?

total_df.repartition(32).write \
			.format("delta") \
			.mode("overwrite") \
			.option("overwriteSchema", "true") \
			.option("delta.autoOptimize.optimizeWrite", "true") \
			.option("delta.autoOptimize.autoCompact", "true") \
			.saveAsTable(table_name)

Could someone suggest how writing speed could be impoved? Or this is delta table's problem and it is not possible increase the speed. As it is strange as the dataset itself not so huge just ~2Gb

Louis_Frolio · ‎04-08-2025

Have you looked in the Spark UI to see exactly where the bottlenecks are? Also, what is the format of the data you are writing? e.g. CSV, Parquet, Delta?

Louis_Frolio · ‎04-08-2025

Also, have you tried without the reparttion transformation? Have you also tried without autoOptimize options?

thackman · ‎04-08-2025

1) You might need to cache the dataframe so it's not recomputing for the write

2) What type of cloud storage are you using? We've noticed slow delta writes as well. We are using Azure standard storage which is backed by spinning disks. It's limited to 60MiB per second. We been wondering if our issue is being IO bound and if switching to premium SSD storage would help. But, I'm not very familiar with SparkUI and haven't been able to find a way to prove it. Based on your numbers 2GB at 60MB/s should only take 30 seconds. If you are getting 60Mb/s it would be 5 minutes. But, Azure Blob seems to scale fairly well across multiple files.

Databricks Community

Overwriting delta table takes lot of time

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! December 12 – 21, 2025

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

Celebrating Our First Brickster Champion: Louis Frolio