cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Long runtimes on simple copying of data

kenmyers-8451
New Contributor II

Hi my team has been trying to identify areas where we can improve our processes. We have some long runtimes on processes that have multiple joins and aggregations. To create a baseline we have been running tests on a simple select and write operation. Even with these tests though we are seeing runtimes of 30 minutes to copy the data from one table to another.

We have tried to dig into spark UI and we notice that the WholeStageCodeGen has the longest compute time

kenmyers8451_0-1739400824751.png

I am starting to think that that is due to deletion vectors but I am wondering if anyone has any thoughts why that step takes so long and if we can do anything to improve the speed. There is also a sort step that is happening after this which we also are confused about as both the input and output tables are partitioned by the same key. I am thinking that after it does the filter, it loses the partition and has to re-sort things before it writes. Any thoughts on that?

Query example:

INSERT OVERWRITE TABLE path.to.target_table ( 
  SELECT
    col_1,
    col_2,
    ...
    col_20
  FROM 
    source_table
)

Environment:

  • dbr 14.3
  • 1 driver, 8 workers all e8as_v4
  • source table size: 2.9 billion rows, full size ~560GB, size read in ~189GB
  • both the source and target tables are partitioned by a date of int type in the form YYYYMMDD
  • source and target tables have deletion vectors enabled
1 REPLY 1

kenmyers-8451
New Contributor II

I ran another test this week where I changed my source table to not have deletion vectors and no longer believe that step is limiting factor. Without it the compute times seemed to be as follows:

  • reading in data + wscg + exchange = 7.9 hours, ~1/3 wall time
  • AQEShuffleRead+sort+writefiles = 22.8 hours, ~2/3 wall time
    • sort = 1.46 hours
    • no information on compute times of the other steps
    •  

So it seems like the latter stages are the bottleneck, possibly during the writing to delta step. Unfortunately I can't seem to find anything else about the runtimes.

kenmyers8451_0-1739892538956.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group