cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Suddenly Getting Timeout Errors Across All Environments while waiting for Python REPL to start.

MGeiss
New Contributor III

Hey - we currently have 4 environments spread out across separate workspaces, and as of Monday we've began to have transient failures in our DLT pipeline runs with the following error:
"java.util.concurrent.TimeoutException: Timed out after 60 seconds waiting for Python REPL to start"
These transient errors have even made it to our Production environment, which has not been changed in the past few weeks. This seems specific to Delta Live Table Pipelines I believe and they have been running without issue for months...until 9/9 when these issues started occurring.

We've investigated multiple options, including refactoring our pipelines and working through spark configs. Some of the changes made have made performance 'better' but transient failures are still occurring. My main question is - has anyone else begun to experience these issues as well? Or could someone point me towards a resolution? 

I can provide whatever relevant info that will help diagnose. 

Thank you!

 

1 ACCEPTED SOLUTION

Accepted Solutions

MGeiss
New Contributor III

For anyone else who may be experiencing this issue - it seems to have been related to serverless compute for notebooks/workflows, which we had enabled for the account, but WERE NOT using for our DLT pipelines. After noticing references to serverless in the logs of our failing pipelines and then disabling at the account level, our pipelines are running fine again. Also of note - this is Databricks in Azure. 

View solution in original post

3 REPLIES 3

filipniziol
Contributor

Is there any chance you are running jobs on a shared cluster instead of job cluster?
What about timing out because the clusters are experiencing high load and cannot process all the requests on time?

MGeiss
New Contributor III

We are using job computes that are spun up as part of a pool. We've also tested with singular jobs/pipelines (very simple table joins that result in dlt tables) and experienced the same issue. So I don't think the issue is high load. 

MGeiss
New Contributor III

For anyone else who may be experiencing this issue - it seems to have been related to serverless compute for notebooks/workflows, which we had enabled for the account, but WERE NOT using for our DLT pipelines. After noticing references to serverless in the logs of our failing pipelines and then disabling at the account level, our pipelines are running fine again. Also of note - this is Databricks in Azure. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group