cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds

vgupta
New Contributor II

Dear Community, Hope you are doing well.

For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR;

INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds. at 2023-03-12 21:26:51 IST (Please see the screenshot),

DLT_ERROR 

When I try to check the events for the driver, It says that "Cluster terminated by system-user" (at 2023-03-12 21:26:47 IST), and Could not find any details associated with this event. And this is happening again and again, every time Pipeline re-starts and runs around 1 hour or sometimes 1.5 hours fine, and then the same.

DLT_Cluster_eventsCould anyone please help us with the priority, what, and why it starts happening suddenly? Because no changes were done in the pipeline code as well as data volume recently. Also, I tried increasing workers from 6 to 10. And issue remains the same.

Please note earlier it was running fine with 6 clusters as well.

Provider: Azure Databricks

Any help on priority will be really appreciated, as this is impacting our Production Data pipelines.

5 REPLIES 5

Debayan
Databricks Employee
Databricks Employee

Hi,

Could you please confirm your cluster configuration details? Also, did you verify the network configuration between the Control plane and Dataplane?

please tag @Debayanโ€‹ with your next response which will notify me, Thank you!

vgupta
New Contributor II

Thanks @Debayan Mukherjeeโ€‹ , Thanks for your response.

Below is the screenshot for cluster configurationsimage:

And If I understand correctly, As of now, we do not have any restrictions at the network layer between the control plane and data plane, these all are default.

imagePlease guide me through if you are looking for anything specific for networking configurations.

Debayan
Databricks Employee
Databricks Employee

Hi @Vishnu Guptaโ€‹ , thanks for the details.

You can refer to https://kb.databricks.com/en_US/jobs/driver-unavailable which probably the issue here.

Please let us know if this helps, please tag @Debayanโ€‹ with your next response which will notify me, Thank you!

Anonymous
Not applicable

Hi @Vishnu Guptaโ€‹ 

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you.

Thank you!

Reddy-24
New Contributor II

Hello @Debayan , I am facing same issue, while running Delta live table, This job is running in produtcuion, but it's not working in dev, i have tried to increae the worker nodes but no use. Can you please help on this.

Reddy24_0-1694165445213.png

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group