DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-12-2023 10:20 AM
Dear Community, Hope you are doing well.
For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR;
INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds. at 2023-03-12 21:26:51 IST (Please see the screenshot),
When I try to check the events for the driver, It says that "Cluster terminated by system-user" (at 2023-03-12 21:26:47 IST), and Could not find any details associated with this event. And this is happening again and again, every time Pipeline re-starts and runs around 1 hour or sometimes 1.5 hours fine, and then the same.
Could anyone please help us with the priority, what, and why it starts happening suddenly? Because no changes were done in the pipeline code as well as data volume recently. Also, I tried increasing workers from 6 to 10. And issue remains the same.
Please note earlier it was running fine with 6 clusters as well.
Provider: Azure Databricks
Any help on priority will be really appreciated, as this is impacting our Production Data pipelines.
- Labels:
-
Azure
-
ClusterTermination
-
DLT
-
Internal error
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-12-2023 11:22 PM
Hi,
Could you please confirm your cluster configuration details? Also, did you verify the network configuration between the Control plane and Dataplane?
please tag @Debayanโ with your next response which will notify me, Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-14-2023 12:18 AM
Thanks @Debayan Mukherjeeโ , Thanks for your response.
Below is the screenshot for cluster configurations:
And If I understand correctly, As of now, we do not have any restrictions at the network layer between the control plane and data plane, these all are default.
Please guide me through if you are looking for anything specific for networking configurations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-15-2023 10:49 PM
Hi @Vishnu Guptaโ , thanks for the details.
You can refer to https://kb.databricks.com/en_US/jobs/driver-unavailable which probably the issue here.
Please let us know if this helps, please tag @Debayanโ with your next response which will notify me, Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-18-2023 12:31 AM
Hi @Vishnu Guptaโ
Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.
Please help us select the best solution by clicking on "Select As Best" if it does.
Your feedback will help us ensure that we are providing the best possible service to you.
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-08-2023 02:31 AM
Hello @Debayan , I am facing same issue, while running Delta live table, This job is running in produtcuion, but it's not working in dev, i have tried to increae the worker nodes but no use. Can you please help on this.