Databricks Community

michelleliu · ‎06-17-2025

I've been seeing patterns in DLT process time in all my pipelines, as in attached screenshot. Each data point is an "update" that's set to "continuous". The process time keeps increasing until a point and drops back to what it's desired to be. This was not by any manual change and I don't see any correlation to hour of the day, or day of the week etc, and it doesn't look like due to data feed either. Has anyone seen this and could help me to understand why it's like this and how to optimize?

Thanks!

lingareddy_Alva · ‎06-17-2025

Hi @michelleliu

This sawtooth pattern in DLT processing times is actually quite common and typically indicates one of several underlying issues. Here are the most likely causes and solutions:

Common Causes
1. Memory Pressure & Garbage Collection

Processing times increase as memory fills up with cached data, shuffle files, or intermediate results
Eventually triggers major garbage collection or memory cleanup, causing the "drop" back to baseline
More common with streaming workloads that accumulate state over time

2. Checkpoint Growth

Streaming checkpoints grow over time, making recovery operations slower
Periodic checkpoint cleanup causes the reset to faster times

3. Auto-scaling Behavior

Cluster starts with optimal resources, gradually loses executors due to perceived idle time
Eventually scales back up when performance degrades enough
The "drop" represents fresh executors joining

4. State Store Compaction

Stateful streaming operations accumulate state files
Periodic compaction/cleanup resets performance

LR

View solution in original post

lingareddy_Alva · ‎06-17-2025

Hi @michelleliu

This sawtooth pattern in DLT processing times is actually quite common and typically indicates one of several underlying issues. Here are the most likely causes and solutions:

Common Causes
1. Memory Pressure & Garbage Collection

Processing times increase as memory fills up with cached data, shuffle files, or intermediate results
Eventually triggers major garbage collection or memory cleanup, causing the "drop" back to baseline
More common with streaming workloads that accumulate state over time

2. Checkpoint Growth

Streaming checkpoints grow over time, making recovery operations slower
Periodic checkpoint cleanup causes the reset to faster times

3. Auto-scaling Behavior

Cluster starts with optimal resources, gradually loses executors due to perceived idle time
Eventually scales back up when performance degrades enough
The "drop" represents fresh executors joining

4. State Store Compaction

Stateful streaming operations accumulate state files
Periodic compaction/cleanup resets performance

LR

michelleliu · ‎06-18-2025

Thank you @lingareddy_Alva! This is very insightful! I want to start from cleaning up memory and GC. I did a quick google and didn't find anything very solid. Should it be some job set up to run on the pipeline ID? Do you have any reference?

lingareddy_Alva · ‎06-18-2025

Hi @michelleliu

For DLT pipeline monitoring, you have several options depending on what you want to achieve:

Built-in DLT Monitoring: Try this

# Add this table to your existing DLT pipeline
@dlt.table(
    comment="Pipeline performance metrics",
    table_properties={"quality": "bronze"}
)
def pipeline_performance_metrics():
    import time
    from datetime import datetime
    
    def collect_metrics():
        executor_infos = spark.sparkContext.statusTracker().getExecutorInfos()
        
        return [{
            "timestamp": datetime.now(),
            "pipeline_id": spark.conf.get("spark.databricks.pipelines.pipelineId"),
            "update_id": spark.conf.get("spark.databricks.pipelines.updateId"),
            "active_executors": len([e for e in executor_infos if e.executorId != "driver"]),
            "total_memory_mb": sum([e.maxMemory for e in executor_infos if e.executorId != "driver"]) / 1024 / 1024,
            "used_memory_mb": sum([e.memoryUsed for e in executor_infos if e.executorId != "driver"]) / 1024 / 1024,
            "memory_utilization": sum([e.memoryUsed for e in executor_infos if e.executorId != "driver"]) / max(sum([e.maxMemory for e in executor_infos if e.executorId != "driver"]), 1)
        }]
    
    return spark.createDataFrame(collect_metrics())

LR