โ03-12-2023 10:20 AM
Dear Community, Hope you are doing well.
For the last couple of days I am seeing very strange issues with my DLT pipeline, So every 60-70 mins it is getting failed in continuous mode, with the ERROR;
INTERNAL_ERROR: Communication lost with driver. Cluster 0312-140502-k9monrjc was not reachable for 120 seconds. at 2023-03-12 21:26:51 IST (Please see the screenshot),
 
When I try to check the events for the driver, It says that "Cluster terminated by system-user" (at 2023-03-12 21:26:47 IST), and Could not find any details associated with this event. And this is happening again and again, every time Pipeline re-starts and runs around 1 hour or sometimes 1.5 hours fine, and then the same.
Could anyone please help us with the priority, what, and why it starts happening suddenly? Because no changes were done in the pipeline code as well as data volume recently. Also, I tried increasing workers from 6 to 10. And issue remains the same.  
Please note earlier it was running fine with 6 clusters as well.
Provider: Azure Databricks
Any help on priority will be really appreciated, as this is impacting our Production Data pipelines.
โ03-12-2023 11:22 PM
Hi,
Could you please confirm your cluster configuration details? Also, did you verify the network configuration between the Control plane and Dataplane?
please tag @Debayanโ with your next response which will notify me, Thank you!
โ03-14-2023 12:18 AM
Thanks @Debayan Mukherjeeโ , Thanks for your response.
Below is the screenshot for cluster configurations:
And If I understand correctly, As of now, we do not have any restrictions at the network layer between the control plane and data plane, these all are default.
Please guide me through if you are looking for anything specific for networking configurations. 
โ03-15-2023 10:49 PM
Hi @Vishnu Guptaโ , thanks for the details.
You can refer to https://kb.databricks.com/en_US/jobs/driver-unavailable which probably the issue here.
Please let us know if this helps, please tag @Debayanโ with your next response which will notify me, Thank you!
 
					
				
		
โ03-18-2023 12:31 AM
Hi @Vishnu Guptaโ
Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.
Please help us select the best solution by clicking on "Select As Best" if it does.
Your feedback will help us ensure that we are providing the best possible service to you.
Thank you!
โ09-08-2023 02:31 AM
Hello @Debayan , I am facing same issue, while running Delta live table, This job is running in produtcuion, but it's not working in dev, i have tried to increae the worker nodes but no use. Can you please help on this.
โ05-06-2025 01:53 AM
We had similar error for one the DLT pipeline, This could be some times because of compute size, we had increased compute size of server in your DLT pipelines, still we were seeing this error while processing very large file.
we then added below parameter to the DLT pipeline configuration, as the default timeout is 120s which increased to 3600s, then the pipeline ran successfully
pipeline.timeout=3600s
pipeline.clusterShutdown.delay=120s
 
					
				
				
			
		
 
					
				
				
			
		
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now