Databricks Community

mehalrathod · ‎04-30-2025

One of our Databricks notebook (using python, py-spark) has been running long for 12+ hours specifically on the overwrite command into a table. This notebook along with overwrite step has been completed within 10 mins in the past. But suddenly the overwrite step now takes 12+ hours and eventually times out.
There were no cluster changes or anything that would make it run this long. I wanted to see if anyone in the community faced anything similar or would be aware of any recent Databricks bugs/issues.

mehalrathod · ‎04-30-2025

We use Azure Databricks.

lingareddy_Alva · ‎04-30-2025

Hi @mehalrathod

This sort of performance regression in Databricks (especially for overwrite) is usually caused by one or more of the following:

Common Causes of Overwrite Slowness
1. Delta Table History or File Explosion
- If the target table is a Delta table, check if the number of files/versions has grown significantly.
- Over time, Delta tables accumulate many small files, especially if OPTIMIZE and VACUUM haven't been run regularly.
- Overwrite may trigger file listing, conflict resolution, or transaction logs processing.

Check:
DESCRIBE HISTORY your_table_name;

Check the number of versions and look for high file count with:
spark.read.format("delta").load("/mnt/path/to/table").inputFiles().length

2. Partition Overwrite Behavior Change
- If you are overwriting a partitioned Delta table, and the overwrite mode changed from `dynamic` to `static`, it could result in writing all partitions, not just affected ones.

Confirm mode:
spark.conf.get("spark.sql.sources.partitionOverwriteMode")
Should be 'dynamic' for best performance

Set it explicitly:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

3. Compaction or OPTIMIZE Running Concurrently
- Check if any OPTIMIZE or ZORDER operations are running on the table in parallel (scheduled or manual).
- These operations can lock or block writes and make overwrites crawl or fail.

4. Concurrency or Lock Contention
Delta Lake uses optimistic concurrency control — if another process is holding the lock or continuously modifying the table, your overwrite may be waiting.

ConcurrentAppendException
TransactionConflictException

Also check the _delta_log folder size and metadata load time.

5. Table Metadata or Schema Drift
Large schema evolution, column reorderings, or misalignment between the DF schema and table schema can cause Spark to do heavy metadata planning and validation, which adds time.

Check if the dataframe schema has recently changed subtly (e.g., column types, order, nullability).

Quick Diagnostic Tips

1. Enable Spark UI and Logs Analysis: Look at Stage Details and Job DAG during the overwrite. Often, the issue is in a shuffle or file-level metadata operation.
2. Reproduce the Write on a Smaller Subset:
- Try .limit(10000) and overwrite to same table — does it still take long?
3. Try Write to a Temp Table:
- Same data, write to a new path/table → Is performance OK?
- If yes, the issue is with the target Delta table, not the data or compute.

Workarounds & Fixes:
- Run VACUUM and OPTIMIZE on the table periodically.
- Repartition the DF before write to avoid file explosion:
df.repartition(200).write.mode("overwrite").format("delta").saveAsTable("...")
- Try replacing `.saveAsTable()` with direct path write if using external tables.
- Upgrade Runtime if you're on an older DBR version — file I/O and Delta writers are optimized in newer runtimes.

LR

Databricks Community

Overwrite to a table taking 12+ hours

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.