Frequent timeouts and bootstrap errors when launching Databricks Delta Live Table (DLT) pipeline clusters on AWS are usually caused by network connectivity issues, VPC misconfigurations, or resource allocation problems between Databricks' control plane and your cloud account's data plane. This error is considered transient if it occurs occasionally and may go away after a restart, but persistent failures need targeted troubleshooting.โ
Common Causes
-
Network Problems: Connectivity between your AWS VPC/subnet and Databricks services may be blocked or throttled, often by firewall rules or peering misconfigurations.โ
-
VPC Configuration Issues: Necessary Databricks service IPs and domains (control plane, blob storage, metastore, etc.) must be allowlisted in your VPC.โ
-
Resource Constraints: The cluster may be undersized for your workload or instances may not be allocating correctly due to quota, spot interruptions, or network bottlenecks.โ
-
Recent Network Changes: If you recently modified VPC peering, routing tables, or DNS, this could interrupt bootstrap processes even after reverting changes.โ
Immediate Steps to Fix
-
Restart the Pipeline: This often clears up transient cloud infrastructure errors if it's not a systemic issue.โ
-
Check Databricks Service Status: Verify there are no ongoing AWS-related outages for your region via the Databricks service status page.โ
-
Audit VPC Firewall Rules & Allowlisting: Ensure all required Databricks FQDNs and IPs are allowlisted for both control plane and storage access. Refer to the official Databricks documentation for a full list.โ
-
Review Cluster Logs: Download the EC2 instance system log from the AWS console, look for bootstrap errors, and decode any FAILED_MESSAGE for further details.โ
-
Scale Up Resources: Consider using larger or dedicated (non-spot) cluster instances if your workload has increased or is hitting resource limits.โ
Advanced Troubleshooting
-
Create Minimal Cluster: Try launching a default Databricks cluster with minimal configuration to isolate whether the issue is related to specific networking or security settings.โ
-
Check DNS and Routing: If you are using VPC peering or custom DNS, confirm that Databricks management and data VPCs can communicate, especially if peering changes were made recently.โ
-
Contact Support: If timeouts persist after verifying setup, collect the cluster log messages and error details to share with Databricks support. They may identify subtle infrastructure or backend issues.โ
Summary Table
| Common Cause |
Typical Fix |
| Network/Firewall |
Update allowlists & firewall rules โ |
| VPC Misconfiguration |
Correct subnets, routing, DNS โ |
| Resource Limits |
Scale up instances, avoid spot for core jobs โ |
| Transient Cloud Fault |
Restart pipeline/job โ |
If the pipeline still fails after these steps, provide error logs and configuration details to Databricks support for direct assistance.โ