cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Performance

Gilg
Contributor II

Hi,

Context:

I have created a Delta Live Table pipeline in a UC enabled workspace that is set to Continuous.

Within this pipeline,

I have bronze which uses Autoloader and reads files stored in ADLS Gen2 storage account in a JSON file format. We received files 200 files per minute and sizes of this files can vary upto MB. 

I have Silver tables that reads Bronze which we use APPLY_CHANGES in SCD2 enabled. 

Gold tables are mainly uses for aggregation and report specific.

At first, we see that it performed very well. But as data grows so does the performance goes down. In the first few millions it processed, it only took 5-8 mins from Bronze > Silver > Gold. Now it tooks 2-3 hrs to finished.

Upon looking at the job stages, I see some Scheduler Delay and Executor Computing Time getting longer in the Bronze. 

I tried to set maxFilePerTrigger to 200. But this having the same.

Anyone has this behavior in DLT and how to optimize this.

Cheers,

Gil

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Gilg, It’s great that you’ve set up a Delta Live Table (DLT) pipeline! However, it’s not uncommon to encounter performance degradation as your data grows.

Let’s explore some strategies to optimize your DLT pipeline:

  1. Partitioning and Clustering:

    • Ensure that your Bronze table is properly partitioned and clustered. Partitioning helps organize data into smaller chunks, while clustering arranges data within each partition based on specific columns. Properly chosen partition and cluster keys can significantly improve query performance.
    • For example, if your data has a timestamp column, consider using it for partitioning. Clustering can be based on columns frequently used in joins or filters.
  2. Optimize Your Bronze Table:

  3. Review Your SCD2 Logic:

    • The APPLY_CHANGES operation in your Silver table (SCD2 logic) can be resource-intensive. Make sure your SCD2 logic is efficient and optimized.
    • Consider using incremental updates instead of reprocessing the entire dataset each time.
  4. Monitor and Tune Executors:

    • Investigate the Scheduler Delay and Executor Computing Time in your Bronze stage. These metrics can provide insights into bottlenecks.
    • Adjust the number of executors, memory allocation, and parallelism settings based on your workload and available resources.
  5. Enhanced Auto Scaling:

  6. Pipeline Scheduling:

    • If your pipeline doesn’t require continuous processing, consider running it in triggered mode rather than continuous mode. Triggered pipelines allow better control over execution and cost.
  7. Data Size and Frequency:

    • You mentioned receiving 200 files per minute, with varying sizes. Consider batching or aggregating smaller files to reduce the overhead of processing individual files.
    • Also, evaluate the frequency of data arrival. If possible, adjust the batch size or frequency to balance performance and resource utilization.
  8. Monitoring and Profiling:

    • Regularly monitor your pipeline’s performance using DLT’s observability UI. Identify any anomalies or areas for improvement.
    • Profile your queries to understand where the bottlenecks occur. Use Databricks’ query profiling tools to analyze execution plans.

 

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.