topic Re: DLT Performance Issue in Data Engineering

DLT Performance Issue

michelleliu — Tue, 17 Jun 2025 17:53:06 GMT

I've been seeing patterns in DLT process time in all my pipelines, as in attached screenshot. Each data point is an "update" that's set to "continuous". The process time keeps increasing until a point and drops back to what it's desired to be. This was not by any manual change and I don't see any correlation to hour of the day, or day of the week etc, and it doesn't look like due to data feed either. Has anyone seen this and could help me to understand why it's like this and how to optimize?

Thanks!

Re: DLT Performance Issue

lingareddy_Alva — Wed, 18 Jun 2025 00:37:32 GMT

Hi @michelleliu

This sawtooth pattern in DLT processing times is actually quite common and typically indicates one of several underlying issues. Here are the most likely causes and solutions:

Common Causes
1. Memory Pressure & Garbage Collection

Processing times increase as memory fills up with cached data, shuffle files, or intermediate results
Eventually triggers major garbage collection or memory cleanup, causing the "drop" back to baseline
More common with streaming workloads that accumulate state over time

2. Checkpoint Growth

Streaming checkpoints grow over time, making recovery operations slower
Periodic checkpoint cleanup causes the reset to faster times

3. Auto-scaling Behavior

Cluster starts with optimal resources, gradually loses executors due to perceived idle time
Eventually scales back up when performance degrades enough
The "drop" represents fresh executors joining

4. State Store Compaction

Stateful streaming operations accumulate state files
Periodic compaction/cleanup resets performance

Re: DLT Performance Issue

michelleliu — Wed, 18 Jun 2025 18:25:20 GMT

Thank you @lingareddy_Alva! This is very insightful! I want to start from cleaning up memory and GC. I did a quick google and didn't find anything very solid. Should it be some job set up to run on the pipeline ID? Do you have any reference?

Re: DLT Performance Issue

lingareddy_Alva — Wed, 18 Jun 2025 23:23:57 GMT

Hi @michelleliu

For DLT pipeline monitoring, you have several options depending on what you want to achieve:

Built-in DLT Monitoring: Try this

# Add this table to your existing DLT pipeline @dlt.table( comment="Pipeline performance metrics", table_properties={"quality": "bronze"} ) def pipeline_performance_metrics(): import time from datetime import datetime def collect_metrics(): executor_infos = spark.sparkContext.statusTracker().getExecutorInfos() return [{ "timestamp": datetime.now(), "pipeline_id": spark.conf.get("spark.databricks.pipelines.pipelineId"), "update_id": spark.conf.get("spark.databricks.pipelines.updateId"), "active_executors": len([e for e in executor_infos if e.executorId != "driver"]), "total_memory_mb": sum([e.maxMemory for e in executor_infos if e.executorId != "driver"]) / 1024 / 1024, "used_memory_mb": sum([e.memoryUsed for e in executor_infos if e.executorId != "driver"]) / 1024 / 1024, "memory_utilization": sum([e.memoryUsed for e in executor_infos if e.executorId != "driver"]) / max(sum([e.maxMemory for e in executor_infos if e.executorId != "driver"]), 1) }] return spark.createDataFrame(collect_metrics())