Hey - we currently have 4 environments spread out across separate workspaces, and as of Monday we've began to have transient failures in our DLT pipeline runs with the following error:
"java.util.concurrent.TimeoutException: Timed out after 60 seconds waiting for Python REPL to start"
These transient errors have even made it to our Production environment, which has not been changed in the past few weeks. This seems specific to Delta Live Table Pipelines I believe and they have been running without issue for months...until 9/9 when these issues started occurring.
We've investigated multiple options, including refactoring our pipelines and working through spark configs. Some of the changes made have made performance 'better' but transient failures are still occurring. My main question is - has anyone else begun to experience these issues as well? Or could someone point me towards a resolution?
I can provide whatever relevant info that will help diagnose.
Thank you!