cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakeflow pipeline (formerly DLT pipeline) performance progressively degrades on a persistent cluster

rcostanza
New Contributor III

I have a small (under 20 tables, all streaming) DLT pipeline running in triggered mode, scheduled every 15min during the workday.  For development I've set `pipelines.clusterShutdown.delay` to avoid having to start a cluster every update.

I've noticed that the updates' runtimes are progressively worse as the time goes on, ultimately doubling in time after only 2h. It increases progressively even after updates on where there are no updates to any of the tables; each table's update duration is individually low but the overall runtime is high. Eventually we have to let the compute shut down to restart and regain performance.

Cluster metrics show nothing out of ordinary; even though free memory slowly decreases over time there's still enough, and CPU load is way below its limit even at its peak. There's nothing obviously wrong in the logs either.

I'm assuming restarting the cluster periodically is expected somehow, but what if it were a continuous pipeline instead where it would stay up until manually shut down, wouldn't this issue be more prominent?

Is there a way to mitigate this without restarting the cluster several times a day?

1 REPLY 1

jerrygen78
New Contributor III

You're right to be concerned — this sounds like a classic case of memory or resource leakage over time, which can affect long-running jobs even if metrics look okay on the surface. In triggered DLT (now Lakeflow) pipelines, tasksand state can accumulate in memory, especially with streaming workloads. For continuous pipelines, this degradation would likely be worse. While a restart is the simplest fix, you can mitigate this by optimizing stateful operations (like joins and aggregations), enabling state cleanup settings, and ensuring checkpoint locations aren't bloating. Also consider using autoscaling clusters with auto-shutdown enabled between runs to reset state without manual restarts.

 
Ask Cha