Long runtimes on simple copying of data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-12-2025 03:04 PM
Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation. Even with these tests though we are seeing runtimes of 30 minutes to copy the data from one table to another.
We have tried to dig into spark UI and we notice that the WholeStageCodeGen has the longest compute time
I am starting to think that that is due to deletion vectors but I am wondering if anyone has any thoughts why that step takes so long and if we can do anything to improve the speed. There is also a sort step that is happening after this which we also are confused about as both the input and output tables are partitioned by the same key. I am thinking that after it does the filter, it loses the partition and has to re-sort things before it writes. Any thoughts on that?
Query example:
INSERT OVERWRITE TABLE path.to.target_table (
SELECT
col_1,
col_2,
...
col_20
FROM
source_table
)
Environment:
- dbr 14.3
- 1 driver, 8 workers all e8as_v4
- source table size: 2.9 billion rows, full size ~560GB, size read in ~189GB
- both the source and target tables are partitioned by a date of int type in the form YYYYMMDD
- source and target tables have deletion vectors enabled
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-18-2025 07:26 AM - edited 02-18-2025 07:29 AM
I ran another test this week where I changed my source table to not have deletion vectors and no longer believe that step is limiting factor. Without it the compute times seemed to be as follows:
- reading in data + wscg + exchange = 7.9 hours, ~1/3 wall time
- AQEShuffleRead+sort+writefiles = 22.8 hours, ~2/3 wall time
- sort = 1.46 hours
- no information on compute times of the other steps
So it seems like the latter stages are the bottleneck, possibly during the writing to delta step. Unfortunately I can't seem to find anything else about the runtimes.

