DLT Performance

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi,

Context:

I have created a Delta Live Table pipeline in a UC enabled workspace that is set to Continuous.

Within this pipeline,

I have bronze which uses Autoloader and reads files stored in ADLS Gen2 storage account in a JSON file format. We received files 200 files per minute and sizes of this files can vary upto MB.

I have Silver tables that reads Bronze which we use APPLY_CHANGES in SCD2 enabled.

Gold tables are mainly uses for aggregation and report specific.

At first, we see that it performed very well. But as data grows so does the performance goes down. In the first few millions it processed, it only took 5-8 mins from Bronze > Silver > Gold. Now it tooks 2-3 hrs to finished.

Upon looking at the job stages, I see some Scheduler Delay and Executor Computing Time getting longer in the Bronze.

I tried to set maxFilePerTrigger to 200. But this having the same.

Anyone has this behavior in DLT and how to optimize this.

Cheers,

Gil