cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Performance

Gilg
Contributor II

Hi,

Context:

I have created a Delta Live Table pipeline in a UC enabled workspace that is set to Continuous.

Within this pipeline,

I have bronze which uses Autoloader and reads files stored in ADLS Gen2 storage account in a JSON file format. We received files 200 files per minute and sizes of this files can vary upto MB. 

I have Silver tables that reads Bronze which we use APPLY_CHANGES in SCD2 enabled. 

Gold tables are mainly uses for aggregation and report specific.

At first, we see that it performed very well. But as data grows so does the performance goes down. In the first few millions it processed, it only took 5-8 mins from Bronze > Silver > Gold. Now it tooks 2-3 hrs to finished.

Upon looking at the job stages, I see some Scheduler Delay and Executor Computing Time getting longer in the Bronze. 

I tried to set maxFilePerTrigger to 200. But this having the same.

Anyone has this behavior in DLT and how to optimize this.

Cheers,

Gil

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Gilg, It’s great that you’ve set up a Delta Live Table (DLT) pipeline! However, it’s not uncommon to encounter performance degradation as your data grows.

Let’s explore some strategies to optimize your DLT pipeline:

  1. Partitioning and Clustering:

    • Ensure that your Bronze table is properly partitioned and clustered. Partitioning helps organize data into smaller chunks, while clustering arranges data within each partition based on specific columns. Properly chosen partition and cluster keys can significantly improve query performance.
    • For example, if your data has a timestamp column, consider using it for partitioning. Clustering can be based on columns frequently used in joins or filters.
  2. Optimize Your Bronze Table:

  3. Review Your SCD2 Logic:

    • The APPLY_CHANGES operation in your Silver table (SCD2 logic) can be resource-intensive. Make sure your SCD2 logic is efficient and optimized.
    • Consider using incremental updates instead of reprocessing the entire dataset each time.
  4. Monitor and Tune Executors:

    • Investigate the Scheduler Delay and Executor Computing Time in your Bronze stage. These metrics can provide insights into bottlenecks.
    • Adjust the number of executors, memory allocation, and parallelism settings based on your workload and available resources.
  5. Enhanced Auto Scaling:

  6. Pipeline Scheduling:

    • If your pipeline doesn’t require continuous processing, consider running it in triggered mode rather than continuous mode. Triggered pipelines allow better control over execution and cost.
  7. Data Size and Frequency:

    • You mentioned receiving 200 files per minute, with varying sizes. Consider batching or aggregating smaller files to reduce the overhead of processing individual files.
    • Also, evaluate the frequency of data arrival. If possible, adjust the batch size or frequency to balance performance and resource utilization.
  8. Monitoring and Profiling:

    • Regularly monitor your pipeline’s performance using DLT’s observability UI. Identify any anomalies or areas for improvement.
    • Profile your queries to understand where the bottlenecks occur. Use Databricks’ query profiling tools to analyze execution plans.

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group