cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Performance

Gilg
Contributor II

Hi,

Context:

I have created a Delta Live Table pipeline in a UC enabled workspace that is set to Continuous.

Within this pipeline,

I have bronze which uses Autoloader and reads files stored in ADLS Gen2 storage account in a JSON file format. We received files 200 files per minute and sizes of this files can vary upto MB. 

I have Silver tables that reads Bronze which we use APPLY_CHANGES in SCD2 enabled. 

Gold tables are mainly uses for aggregation and report specific.

At first, we see that it performed very well. But as data grows so does the performance goes down. In the first few millions it processed, it only took 5-8 mins from Bronze > Silver > Gold. Now it tooks 2-3 hrs to finished.

Upon looking at the job stages, I see some Scheduler Delay and Executor Computing Time getting longer in the Bronze. 

I tried to set maxFilePerTrigger to 200. But this having the same.

Anyone has this behavior in DLT and how to optimize this.

Cheers,

Gil

 

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group