cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Standard_NC8as_T4_v3" and "Standard_NC4as_T4_v3" instances

amoghjain
New Contributor II

I am running into an issue where "Standard_NC8as_T4_v3" and "Standard_NC4as_T4_v3" instances are behaving differently for a 30gb custom docker image, and I am a bit stumped.

when using NC4 instances, I get a timeout, with the exact message shown below

Message

Compute terminated. Reason: Docker image pull failure

Help

Cannot launch the cluster because pulling the docker image failed. Please double check connectivity from workers to the container registry, as well as the credentials used to pull the image.

Instance ID: ------hidden for privacy-----

Internal error message: Container setup failed due to a docker image pull failure: Exception when downloading docker container image: Timed out with exception after 23927 attempts


Now, I am  bit surprised and trying to find out why do NC4 instances have timeout issues, where as NC8 do not. Any help or pointers would be appreciated!
 
Have a nice day!!
 
Thanks
 

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @amoghjain, Certainly! It seems you’re encountering a timeout issue when pulling a custom Docker image on your NC4 instances, while the NC8 instances work fine.

 

Let’s explore some potential reasons and pointers:

 

Network Configuration:

  • Check the connectivity from your NC4 instances to the container registry. Ensure that no network restrictions or firewalls are blocking the communication.
  • Verify that the credentials used for pulling the image are correct. Sometimes, incorrect credentials can lead to timeouts.

DNS Resolution:

  • DNS resolution can cause timeouts during image pulls. Make sure your instances can resolve domain names properly.
  • Check if DNS queries are being run from ephemeral ports. You can add your local DNS cache/server as the primary resolver in /etc/resolv.conf:nameserver 192.168.0.1 nameserver 8.8.8.8

Resource Constraints:

  • NC4 and NC8 instances have different resource profiles. Ensure that the NC4 instances have sufficient resources (CPU, memory, etc.) to handle the image pull process.
  • Consider monitoring resource utilization during the image pull to identify any bottlenecks.

Image Size and Complexity:

  • The custom Docker image size (30GB) might impact the pull process. Larger images take longer to download.
  • Check if the image has any complex layers or dependencies that could cause timeouts.

Retry Mechanism:

  • Docker retries image pulls by default. However, you can adjust the retry behaviour by modifying the Docker configuration.
  • Consider increasing the timeout or adjusting the retry settings to see if it resolves the issue.

Instance-Specific Factors:

  • Investigate if there are any specific differences between NC4 and NC8 instances in terms of networking, security groups, or other configurations.
  • Review logs and error messages to pinpoint the exact cause of the timeout.

 

Have a great day, and I hope this helps! 😊

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group