cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds

vgupta
New Contributor II

Dear Community, Hope you are doing well.

For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR;

INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds. at 2023-03-12 21:26:51 IST (Please see the screenshot),

DLT_ERROR 

When I try to check the events for the driver, It says that "Cluster terminated by system-user" (at 2023-03-12 21:26:47 IST), and Could not find any details associated with this event. And this is happening again and again, every time Pipeline re-starts and runs around 1 hour or sometimes 1.5 hours fine, and then the same.

DLT_Cluster_eventsCould anyone please help us with the priority, what, and why it starts happening suddenly? Because no changes were done in the pipeline code as well as data volume recently. Also, I tried increasing workers from 6 to 10. And issue remains the same.

Please note earlier it was running fine with 6 clusters as well.

Provider: Azure Databricks

Any help on priority will be really appreciated, as this is impacting our Production Data pipelines.

5 REPLIES 5

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi,

Could you please confirm your cluster configuration details? Also, did you verify the network configuration between the Control plane and Dataplane?

please tag @Debayan​ with your next response which will notify me, Thank you!

vgupta
New Contributor II

Thanks @Debayan Mukherjee​ , Thanks for your response.

Below is the screenshot for cluster configurationsimage:

And If I understand correctly, As of now, we do not have any restrictions at the network layer between the control plane and data plane, these all are default.

imagePlease guide me through if you are looking for anything specific for networking configurations.

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi @Vishnu Gupta​ , thanks for the details.

You can refer to https://kb.databricks.com/en_US/jobs/driver-unavailable which probably the issue here.

Please let us know if this helps, please tag @Debayan​ with your next response which will notify me, Thank you!

Anonymous
Not applicable

Hi @Vishnu Gupta​ 

Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.

Please help us select the best solution by clicking on "Select As Best" if it does.

Your feedback will help us ensure that we are providing the best possible service to you.

Thank you!

Reddy-24
New Contributor II

Hello @Debayan , I am facing same issue, while running Delta live table, This job is running in produtcuion, but it's not working in dev, i have tried to increae the worker nodes but no use. Can you please help on this.

Reddy24_0-1694165445213.png

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.