Hi there,
Great breakdown of the symptoms — these are actually two distinct issues likely sharing a common root cause in your VPC/network configuration. Let me address both:
Issue 1: Serverless Compute — .com.br DNS Resolution Failure
Root Cause
Serverless compute in Databricks does NOT run inside your custom VPC. It runs in a Databricks-managed network and egresses through Databricks' own infrastructure. This means:
- Your VPC's outbound Security Group rules (0.0.0.0/0 on port 443) do not apply to Serverless
- Serverless traffic goes through Databricks-controlled egress, which may have its own DNS resolvers and egress filtering
- .com.br TLD resolution can fail if the managed DNS used by Serverless doesn't properly resolve country-code TLDs (ccTLDs) or if those domains are not on Databricks' egress allowlist
Fix for Serverless Connectivity
Option 1 — Use Serverless Network Policies (Recommended) Databricks introduced Serverless Network Policies to control egress from Serverless compute. You need to explicitly allow the .com.br destinations:
- Go to Account Console → Network → Serverless Network Policies
- Add an egress policy that explicitly allows the target .com.br domains/IPs
- This is the correct and supported way to control Serverless egress — Security Groups alone won't work
Option 2 — Contact Databricks Support If the .com.br domains are being blocked at the Databricks-managed egress layer (not your VPC), you'll need Support to confirm whether those ccTLDs are filtered and to whitelist them at the platform level for your workspace in sa-east-1.
Option 3 — Verify DNS explicitly In a Serverless notebook, run:
python
import sockettry:
print(socket.getaddrinfo("yourtarget.com.br", 443))
except Exception as e:
print(f"DNS failed: {e}")This confirms whether it's a DNS resolution failure vs. a TCP/TLS connection block — important distinction for Support.
Issue 2: Classic Cluster — Spark Hanging Indefinitely
Root Cause
Classic clusters do run inside your VPC, so this is almost certainly a VPC networking/configuration problem. A Spark job hanging without starting (not failing — just hanging) typically points to:
Cause Explanation
| Driver ↔ Executor communication blocked | Security Groups may block internal cluster traffic on required ports |
| S3 / Metastore connectivity issue | Unity Catalog metastore or S3 access is blocked, causing Spark context init to stall |
| Missing VPC Endpoints | Required AWS endpoints (S3, STS, KMS) may be missing, causing timeouts |
| DNS resolution failure inside VPC | Custom VPC may have DNS hostnames/resolution not enabled |
Fix for Classic Cluster Spark Hang
Step 1 — Check VPC DNS Settings (Most Common Fix)
In AWS Console → Your VPC → Actions:
- Enable DNS hostnames → must be Yes
- Enable DNS resolution → must be Yes
If either is disabled, Spark nodes can't resolve each other or AWS service endpoints — causing silent hangs.
Step 2 — Verify Security Group Inbound Rules for Internal Traffic
Databricks Classic Clusters require self-referencing inbound rules in the Security Group:
Type Protocol Port Range Source
| All TCP | TCP | 0–65535 | Same Security Group ID |
| All UDP | UDP | 0–65535 | Same Security Group ID |
Without this, Driver and Executor nodes can't communicate — Spark will silently hang.
Step 3 — Verify Required VPC Endpoints Exist
For Unity Catalog + AWS in a custom VPC, these endpoints are strongly recommended:
- com.amazonaws.sa-east-1.s3 (Gateway type)
- com.amazonaws.sa-east-1.sts
- com.amazonaws.sa-east-1.kinesis-streams (if using streaming)
Missing S3 or STS endpoints in a private subnet will cause Spark to stall during initialization.
Step 4 — Check Cluster Event Logs In the Databricks UI → Cluster → Event Log tab, look for timeout or unreachable host errors that may not surface in the notebook itself.
Likely Common Root Cause
Both issues in a custom VPC in sa-east-1 point to an incomplete network configuration:
Serverless Issue → VPC rules don't apply; need Serverless Network Policy for .com.brClassic Hang → Missing self-referencing SG rules OR DNS not enabled in VPC
Recommended action order:
- Fix VPC DNS settings first
- Add self-referencing Security Group rules
- Add missing VPC Endpoints
- Add Serverless Network Policy for .com.br egress
- If Serverless still fails, open a Support ticket with the DNS test result above
Hope this helps unblock both issues! what is the cluster Event Log shows — that'll help narrow down the Classic Cluster hang further.