topic Lakeflow pipeline (formerly DLT pipeline) performance progressively degrades on a persistent cluster in Data Engineering

Lakeflow pipeline (formerly DLT pipeline) performance progressively degrades on a persistent cluster

rcostanza — Tue, 29 Jul 2025 20:57:36 GMT

I have a small (under 20 tables, all streaming) DLT pipeline running in triggered mode, scheduled every 15min during the workday. For development I've set `pipelines.clusterShutdown.delay` to avoid having to start a cluster every update.

I've noticed that the updates' runtimes are progressively worse as the time goes on, ultimately doubling in time after only 2h. It increases progressively even after updates on where there are no updates to any of the tables; each table's update duration is individually low but the overall runtime is high. Eventually we have to let the compute shut down to restart and regain performance.

Cluster metrics show nothing out of ordinary; even though free memory slowly decreases over time there's still enough, and CPU load is way below its limit even at its peak. There's nothing obviously wrong in the logs either.

I'm assuming restarting the cluster periodically is expected somehow, but what if it were a continuous pipeline instead where it would stay up until manually shut down, wouldn't this issue be more prominent?

Is there a way to mitigate this without restarting the cluster several times a day?

Re: Lakeflow pipeline (formerly DLT pipeline) performance progressively degrades on a persistent clu

jerrygen78 — Wed, 30 Jul 2025 04:15:42 GMT

You're right to be concerned — this sounds like a classic case of memory or resource leakage over time, which can affect long-running jobs even if metrics look okay on the surface. In triggered DLT (now Lakeflow) pipelines, tasksand state can accumulate in memory, especially with streaming workloads. For continuous pipelines, this degradation would likely be worse. While a restart is the simplest fix, you can mitigate this by optimizing stateful operations (like joins and aggregations), enabling state cleanup settings, and ensuring checkpoint locations aren't bloating. Also consider using autoscaling clusters with auto-shutdown enabled between runs to reset state without manual restarts.

Ask Cha

Re: Lakeflow pipeline (formerly DLT pipeline) performance progressively degrades on a persistent clu

JargerBiirli — Mon, 02 Feb 2026 07:09:39 GMT

I'm facing this exact issue, only with a standard job instead of a DLT pipeline. I can't use serverless or restart the cluster periodically due to things out of my control. Any specific advice on diagnosis and resolving? I don't think it can be checkpoint bloat since cluster restart solves the issue for a time.