โ07-11-2025 05:03 AM
Hi Team
Whenever I try to create an endpoint from a model in Databricks, the process often gets stuck at the 'Container Image Creation' step. I've tried to understand what happens during this step, but couldn't find any detailed or helpful information. Can someone explain the full sequence of steps Databricks performs in the background when serving a model endpoint?
Thanks,
Dinesh
โ07-22-2025 01:39 AM
hi @Dnirmania
Below is a detailed, sequenced breakdown of what happens in Databricks when you create a model serving endpoin
1. Model Logging and Registration
2. Endpoint Creation Request
3. Background Infrastructure Orchestration Begins
Internally, a state machine-driven workflow (the control plane) handles endpoint provisioning, with the major next step being Container Image Creation.
4. Container Image Creation: Step-by-Step Technical Workflow
a. Gathering Model Artifacts and Environment Metadata
b. Triggering the Container Build
c. Container Build Steps (Inside the Builder Job)
d. Upload/Push the Image to the Registry
5. Deployment to Serving Infrastructure
6. Endpoint Readiness and Autoscaling
7. Ongoing Lifecycle and Updates
What Can Cause The 'Container Image Creation' Step ("stuck"/slow)?
โข โข For GPU-serving, timeouts if build takes more than 60 minutes (retry is sometimes needed).
โ07-22-2025 01:39 AM
hi @Dnirmania
Below is a detailed, sequenced breakdown of what happens in Databricks when you create a model serving endpoin
1. Model Logging and Registration
2. Endpoint Creation Request
3. Background Infrastructure Orchestration Begins
Internally, a state machine-driven workflow (the control plane) handles endpoint provisioning, with the major next step being Container Image Creation.
4. Container Image Creation: Step-by-Step Technical Workflow
a. Gathering Model Artifacts and Environment Metadata
b. Triggering the Container Build
c. Container Build Steps (Inside the Builder Job)
d. Upload/Push the Image to the Registry
5. Deployment to Serving Infrastructure
6. Endpoint Readiness and Autoscaling
7. Ongoing Lifecycle and Updates
What Can Cause The 'Container Image Creation' Step ("stuck"/slow)?
โข โข For GPU-serving, timeouts if build takes more than 60 minutes (retry is sometimes needed).
โ07-28-2025 06:45 AM
Thank you @Vidhi_Khaitan for sharing the detailed process ๐..
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now