cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT constantly failing with time out errors

dataminion01
New Contributor II

DLT was working but then started getting time outs frequently

com.databricks.pipelines.common.errors.deployment.DeploymentException: Failed to launch pipeline cluster xxxxxxxxxxxx: Self-bootstrap timed out during launch. Please try again later and contact Databricks if the problem persists. databricks_error_message:
Instance bootstrap failed command: Bootstrap_e2e
Instance b...
This error could be transient - restart your pipeline and report if you still see the same issue.

Databricks on AWS

Does anyone know how to fix this?

 

 

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Frequent timeouts and bootstrap errors when launching Databricks Delta Live Table (DLT) pipeline clusters on AWS are usually caused by network connectivity issues, VPC misconfigurations, or resource allocation problems between Databricks' control plane and your cloud account's data plane. This error is considered transient if it occurs occasionally and may go away after a restart, but persistent failures need targeted troubleshooting.โ€‹

Common Causes

  • Network Problems: Connectivity between your AWS VPC/subnet and Databricks services may be blocked or throttled, often by firewall rules or peering misconfigurations.โ€‹

  • VPC Configuration Issues: Necessary Databricks service IPs and domains (control plane, blob storage, metastore, etc.) must be allowlisted in your VPC.โ€‹

  • Resource Constraints: The cluster may be undersized for your workload or instances may not be allocating correctly due to quota, spot interruptions, or network bottlenecks.โ€‹

  • Recent Network Changes: If you recently modified VPC peering, routing tables, or DNS, this could interrupt bootstrap processes even after reverting changes.โ€‹

Immediate Steps to Fix

  • Restart the Pipeline: This often clears up transient cloud infrastructure errors if it's not a systemic issue.โ€‹

  • Check Databricks Service Status: Verify there are no ongoing AWS-related outages for your region via the Databricks service status page.โ€‹

  • Audit VPC Firewall Rules & Allowlisting: Ensure all required Databricks FQDNs and IPs are allowlisted for both control plane and storage access. Refer to the official Databricks documentation for a full list.โ€‹

  • Review Cluster Logs: Download the EC2 instance system log from the AWS console, look for bootstrap errors, and decode any FAILED_MESSAGE for further details.โ€‹

  • Scale Up Resources: Consider using larger or dedicated (non-spot) cluster instances if your workload has increased or is hitting resource limits.โ€‹

Advanced Troubleshooting

  • Create Minimal Cluster: Try launching a default Databricks cluster with minimal configuration to isolate whether the issue is related to specific networking or security settings.โ€‹

  • Check DNS and Routing: If you are using VPC peering or custom DNS, confirm that Databricks management and data VPCs can communicate, especially if peering changes were made recently.โ€‹

  • Contact Support: If timeouts persist after verifying setup, collect the cluster log messages and error details to share with Databricks support. They may identify subtle infrastructure or backend issues.โ€‹

Summary Table

Common Cause Typical Fix
Network/Firewall Update allowlists & firewall rules โ€‹
VPC Misconfiguration Correct subnets, routing, DNS โ€‹
Resource Limits Scale up instances, avoid spot for core jobs โ€‹
Transient Cloud Fault Restart pipeline/job โ€‹
 
 

If the pipeline still fails after these steps, provide error logs and configuration details to Databricks support for direct assistance.โ€‹