cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Problems with cluster shutdown in DLT

LucasAntoniolli
New Contributor

[Issue] DLT finishes processing, but cluster remains active due to log write error

Hi everyone, I'm running into a problem with my DLT pipeline and was hoping someone here could help or has experienced something similar.

Problem Description

The pipeline completes data processing successfully, but the cluster stays active for a long time, even though no data is being processed anymore.

After checking the Driver Logs, I noticed that the system keeps trying to write execution logs and cluster information, but encounters an error each time. As a result, it retries every minute and ends up stuck in this loop.

Error Snippet 

09/25/12 11:13:57 ERROR NativeADLGen2RequestComparisonHandler: Error in request comparison
java.lang.NumberFormatException: For input string: "Fri, 12 Sep 2025 11:13:58 GMT"
at java.base/java.lang.Long.parseLong(Long.java:711)
...
at com.databricks.sql.io.NativeADLGen2RequestComparisonHandler.do Handle(NativeADLGen2RequestComparisonHandler.Scala:94) 

It seems that when DLT tries to write to its own event log, it first attempts to read the current log state (e.g., Loading version 306944). The bug appears during this read operation, where it throws a NumberFormatException when parsing a timestamp.

Observations

  • The error does not crash the pipeline, but it seems to trigger a retry mechanism.

  • This leads to a loop: it tries to read โ†’ fails โ†’ waits โ†’ tries again โ€” keeping the cluster alive unnecessarily.

Question

Has anyone else faced this issue? Any idea how to work around it or resolve it?

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

nayan_wylde
Honored Contributor II

Here are some quick workarounds that you can try

1. Development mode keeps a cluster warm for rapid iteration. Production mode stops the cluster right after the run finishes. If you must stay in dev mode, tune the pipelines.clusterShutdown.delay so the cluster doesnโ€™t linger. Change the mode for cost savings.

2. In the driver logs, youโ€™ll see the NumberFormatException repeating roughly every minute even after the pipeline reports โ€œcompletedโ€. Thatโ€™s the smoking gun.  If youโ€™re on a recent DBR (e.g., 15.x/16.x), try pinning the pipeline to DBR 14.3 LTS or, conversely, to the latest LTS to see if the ADLS client code path differs.

View solution in original post

4 REPLIES 4

nayan_wylde
Honored Contributor II

Here are some quick workarounds that you can try

1. Development mode keeps a cluster warm for rapid iteration. Production mode stops the cluster right after the run finishes. If you must stay in dev mode, tune the pipelines.clusterShutdown.delay so the cluster doesnโ€™t linger. Change the mode for cost savings.

2. In the driver logs, youโ€™ll see the NumberFormatException repeating roughly every minute even after the pipeline reports โ€œcompletedโ€. Thatโ€™s the smoking gun.  If youโ€™re on a recent DBR (e.g., 15.x/16.x), try pinning the pipeline to DBR 14.3 LTS or, conversely, to the latest LTS to see if the ADLS client code path differs.

I tested returning the LTS to version 15.4 where the problem was not occurring (current version is 16.4) but in the pipeline it is not accepting to fix the LTS to a previous version, I tried to return it using cluster policies but it automatically pulls the latest version. In the Pipeline in the channel option there are only two options, current and preview, causing the LTS that I put in the policy to be ignored. I also tested putting the LEGACY runtime in the JSON but the DLT no longer accepts this LEGACY parameter.

nayan_wylde
Honored Contributor II

Can you please try one more option. If youโ€™re on Preview, move to Current (or vice versa). Sometimes the regression only exists in one channel.

You won't believe it, my friend. I had already tried everything yesterday and everything was fine. I couldn't find the problem until I read your answer mentioning production and development. It made me go to the button, put the DLT into development and then production again right away. To my surprise, the problem of the cluster not shutting down stopped. The funny thing is that it was in production, and only that DLT had a problem. The rest of the others were working normally. I honestly don't know what happened, but it was solved. Thank you very much for your help and answers.