Databricks Community

kenmyers-8451 · ‎02-12-2025

Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation. Even with these tests though we are seeing runtimes of 30 minutes to copy the data from one table to another.

We have tried to dig into spark UI and we notice that the WholeStageCodeGen has the longest compute time

I am starting to think that that is due to deletion vectors but I am wondering if anyone has any thoughts why that step takes so long and if we can do anything to improve the speed. There is also a sort step that is happening after this which we also are confused about as both the input and output tables are partitioned by the same key. I am thinking that after it does the filter, it loses the partition and has to re-sort things before it writes. Any thoughts on that?

Query example:

INSERT OVERWRITE TABLE path.to.target_table ( 
  SELECT
    col_1,
    col_2,
    ...
    col_20
  FROM 
    source_table
)

Environment:

dbr 14.3
1 driver, 8 workers all e8as_v4
source table size: 2.9 billion rows, full size ~560GB, size read in ~189GB
both the source and target tables are partitioned by a date of int type in the form YYYYMMDD
source and target tables have deletion vectors enabled

kenmyers-8451 · ‎02-18-2025

I ran another test this week where I changed my source table to not have deletion vectors and no longer believe that step is limiting factor. Without it the compute times seemed to be as follows:

reading in data + wscg + exchange = 7.9 hours, ~1/3 wall time
AQEShuffleRead+sort+writefiles = 22.8 hours, ~2/3 wall time
- sort = 1.46 hours
- no information on compute times of the other steps

So it seems like the latter stages are the bottleneck, possibly during the writing to delta step. Unfortunately I can't seem to find anything else about the runtimes.

mark_ott · a month ago

Your slow Spark runtime and unexpectedly long WholeStageCodeGen compute times are likely tied to a mix of Delta Lake features (especially deletion vectors), Spark’s physical plan, and partition handling. Here’s a detailed breakdown and advice based on your example and environment:

Slow WholeStageCodeGen: Likely Contributors

Deletion Vectors: When deletion vectors (DVs) are enabled, Spark must read extra metadata and apply in-memory filters for records marked as deleted. With billions of rows, this bookkeeping can cause noticeable slowdowns during the scan and subsequent code generation steps.
CodeGen Time: If DVs create complex filter expressions (e.g., row not in deletion vector X, Y, Z...), the generated Java code for filtering becomes larger and more complex, leading to more time in WholeStageCodeGen. This is especially true if your table sees lots of small deletes over time instead of VACUUMs that clear them out.

Sort Step Even With Matching Partitioning

Shuffle After Filter: Even if source and target are partitioned by the same column, Spark might not preserve partitioning if:
- The filter (or projection) disrupts partition alignment.
- Data has been filtered such that some partitions are skewed or missing.
- The target table expects a strict sort order for fast data skipping or file compaction in Delta Lake, often enforced via a logical plan that adds a shuffle or explicit sort.
Insert Overwrite Behavior: The INSERT OVERWRITE operation may trigger Spark to repartition or sort data, even when the partitions are aligned, to ensure data is written correctly and efficiently for downstream querying or to fit Z-ordering/optimization requirements of the file format or table.

What You Can Try To Improve Performance

Optimize Deletion Vector Usage:
- Run a VACUUM on the source table if possible to remove files with deletion vectors that are no longer needed. Fewer files with DVs means simpler plans and less compute time for filtering.
- Consider compacting small files and running Delta OPTIMIZE to reduce fragmentation—lots of small files and stale DVs will slow down reads.
Control Partition Mapping:
- Ensure your Spark job is reading and writing with partition pruning and avoid wide shuffles. For direct copy (with matching partition columns), you might be able to use REPARTITION by the partition key before writing to help Spark avoid unnecessary sorts/shuffles.
Tune Write Path:
- Use OPTIMIZE WRITE and AUTO COMPACTION Delta Lake features if available; these try to write data more efficiently in cluster-friendly files.
- Avoid expensive transformations—keep the select as direct as possible, and use PARTITION OVERWRITE MODE = DYNAMIC to minimize the amount of data Spark feels responsible to rewrite on each run.
Investigate Execution Plan:
- In Spark UI, expand the relevant stages and look for the physical plan. Check if there are unexpected wide dependencies or shuffles after the filter node.
- Look for skewed partition keys in your data that might cause one or more workers to lag (i.e., a small range of date values that contain disproportionate records).
Cluster Sizing:
- With 8 e8as_v4 workers, you have a strong base, but if this copy is saturating compute (not I/O), scaling up workers/nodes or optimizing your Spark configuration around shuffle partitions may help. Adjust spark.sql.shuffle.partitions to match your actual parallelism needs.

Possible Workarounds

For large copy jobs, break up by partition (e.g., process a day at a time) instead of a single giant job. This allows for better parallelism and can sidestep unlucky skew or shuffle bottlenecks.
Monitor physical memory and spill-to-disk rates using Spark UI to see if JVM/GC overhead is also part of the lag—this is common with huge CodeGen queries.

In short: deletion vectors do seriously impact scan and compute time, especially on tables with lots of small deletes and infrequent VACUUMs. The sort you’re seeing is probably due to Spark’s need to realign data for correct partitioned writes. Optimizing your Delta tables, rethinking your partition strategy, and tuning cluster settings should help reduce runtimes.