cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Overwriting delta table takes lot of time

drag7ter
Contributor

I'm trying simply to overwrite data into delta table. The Table size is not really huge it has 50 Mil of rows and 1.9Gb in size.

For running this code I use various cluster configurations starting from 1 node cluster 64Gb 16 Vcpu and also I tried to set 3-5 worker node cluster each worker from 32-64Gb with 8-16 Vcpus each.

Also I tried to repartition from 8-512. Also without repartition. But each run I got total time from 5-8 min and it doesn't depend on scaling the cluster. E.g.
On 1 node general purpose 64Gb 16Vcpus and repartition(16) it takes 6 min.

On 1 node general purpose 126Gb 32Vcpus and repartition(32) it takes 5 min 40 sec.

On 3-5 node general purpose 64Gb 16Vcpus each node repartition(32-64) it takes 5 min .

As you can see the time almost the same. And I don't understand why?

total_df.repartition(32).write \
			.format("delta") \
			.mode("overwrite") \
			.option("overwriteSchema", "true") \
			.option("delta.autoOptimize.optimizeWrite", "true") \
			.option("delta.autoOptimize.autoCompact", "true") \
			.saveAsTable(table_name)

 Could someone suggest how writing speed could be impoved? Or this is delta table's problem and it is not possible increase the speed. As it is strange as the dataset itself not so huge just ~2Gb

3 REPLIES 3

BigRoux
Databricks Employee
Databricks Employee

Have you looked in the Spark UI to see exactly where the bottlenecks are? Also, what is the format of the data you are writing? e.g. CSV, Parquet, Delta?

BigRoux
Databricks Employee
Databricks Employee

Also, have you tried without the reparttion transformation?  Have you also tried without autoOptimize options?

thackman
New Contributor III

1) You might need to cache the dataframe so it's not recomputing for the write

2) What type of cloud storage are you using? We've noticed slow delta writes as well. We are using Azure standard storage which is backed by spinning disks. It's limited to 60MiB per second. We been wondering if our issue is being IO bound and if switching to premium SSD storage would help. But, I'm not very familiar with SparkUI and haven't been able to find a way to prove it.  Based on your numbers 2GB at 60MB/s should only take 30 seconds. If you are getting 60Mb/s it would be 5 minutes. But, Azure Blob seems to scale fairly well across multiple files.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now