DLT | Cluster terminated by System-User | INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds

vgupta
New Contributor II

Dear Community, Hope you are doing well.

For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR;

INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds. at 2023-03-12 21:26:51 IST (Please see the screenshot),

DLT_ERROR 

When I try to check the events for the driver, It says that "Cluster terminated by system-user" (at 2023-03-12 21:26:47 IST), and Could not find any details associated with this event. And this is happening again and again, every time Pipeline re-starts and runs around 1 hour or sometimes 1.5 hours fine, and then the same.

DLT_Cluster_eventsCould anyone please help us with the priority, what, and why it starts happening suddenly? Because no changes were done in the pipeline code as well as data volume recently. Also, I tried increasing workers from 6 to 10. And issue remains the same.

Please note earlier it was running fine with 6 clusters as well.

Provider: Azure Databricks

Any help on priority will be really appreciated, as this is impacting our Production Data pipelines.