cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Failed to start cluster: Large docker image

NateJ
New Contributor II

I have a large Docker image in our AWS ECR repo. The image is 27.4 GB locally and 11539.79 MB compressed in ECR.

The error from the Event Log is:

Failed to add 2 containers to the compute. Will attempt retry: true. Reason: Docker image pull failure

JSON:

{
  "reason": {
    "code": "DOCKER_IMAGE_PULL_FAILURE",
    "type": "SERVICE_FAULT",
    "parameters": {
      "instance_id": "i-0172cf9b70a25df47",
      "databricks_error_message": "Downloading docker image has timed out"
    }
  },
  "add_node_failure_details": {
    "failure_count": 2,
    "resource_type": "container",
    "will_retry": true
  }
} 

 

5 REPLIES 5

Michelangelo
New Contributor III

I'm having the same issue--the official Databricks runtime GPU images are already quite large, so using them as a base causes you to run into this timeout issue.  Did anyone ever find a fix?

amoghjain
New Contributor II

I have a similar problem. a 10gb image pulls fine but a 31gb image doesnt. both workers and drivers have 64gb memory. i get the timeout error with "Cannot launch the cluster because pulling the docker image failed. Please double check connectivity from workers to the container registry, as well as the credentials used to pull the image"

were you able to figure out a solution?

Kaniz
Community Manager
Community Manager

Hi @NateJ, One possible solution is to increase the timeout value for the Docker image pull operation. You can also try checking the Docker image configuration and credentials to ensure that they are correct.

Michelangelo
New Contributor III

@Kaniz it's not possible to change the timeout value for the Docker image pull on a Databricks cluster.  That isn't exposed to the user.

The only solution as of now is to reduce the size of your image--try a smaller base image, don't build multiple intermediate images that build off of each other, reduce the number of layers, aggressively purge apt and pip caches, etc.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.