Databricks Community

MGeiss · ‎09-10-2024

Hey - we currently have 4 environments spread out across separate workspaces, and as of Monday we've began to have transient failures in our DLT pipeline runs with the following error:
"java.util.concurrent.TimeoutException: Timed out after 60 seconds waiting for Python REPL to start"
These transient errors have even made it to our Production environment, which has not been changed in the past few weeks. This seems specific to Delta Live Table Pipelines I believe and they have been running without issue for months...until 9/9 when these issues started occurring.

We've investigated multiple options, including refactoring our pipelines and working through spark configs. Some of the changes made have made performance 'better' but transient failures are still occurring. My main question is - has anyone else begun to experience these issues as well? Or could someone point me towards a resolution?

I can provide whatever relevant info that will help diagnose.

Thank you!

MGeiss · ‎09-12-2024

For anyone else who may be experiencing this issue - it seems to have been related to serverless compute for notebooks/workflows, which we had enabled for the account, but WERE NOT using for our DLT pipelines. After noticing references to serverless in the logs of our failing pipelines and then disabling at the account level, our pipelines are running fine again. Also of note - this is Databricks in Azure.

View solution in original post

filipniziol · ‎09-10-2024

Is there any chance you are running jobs on a shared cluster instead of job cluster?
What about timing out because the clusters are experiencing high load and cannot process all the requests on time?

MGeiss · ‎09-10-2024

We are using job computes that are spun up as part of a pool. We've also tested with singular jobs/pipelines (very simple table joins that result in dlt tables) and experienced the same issue. So I don't think the issue is high load.

MGeiss · ‎09-12-2024

For anyone else who may be experiencing this issue - it seems to have been related to serverless compute for notebooks/workflows, which we had enabled for the account, but WERE NOT using for our DLT pipelines. After noticing references to serverless in the logs of our failing pipelines and then disabling at the account level, our pipelines are running fine again. Also of note - this is Databricks in Azure.