cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Serverless Compute connectivity issues with .com.br domains vs. Classic Clusters Spark hangs

ThiagoRosetti
New Contributor

Hi everyone,

I'm facing two specific issues in my Databricks Premium workspace (AWS - sa-east-1).

  1. Serverless Connectivity Issue: When using Serverless compute, I can successfully call APIs ending in .com, but calls to .com.br domains fail with connection/DNS errors. The exact same code works fine when running on a Classic Cluster.

  • VPC Setup: Custom VPC with Unity Catalog enabled.

  • Security Groups: Outbound rules are open for port 443 (0.0.0.0/0).

  • Symptom: It feels like a DNS resolution or Egress filtering issue specific to Serverless.

  1. Classic Cluster Spark Hang: On the other hand, when I switch to a Classic Cluster to bypass the connectivity issue, any Spark command (e.g., spark.read or simple transformations) hangs indefinitely without starting the job.

Has anyone experienced this specific behavior where Serverless ignores certain TLDs or where Spark fails to initialize on Classic Clusters in the same VPC?

Thanks in advance!

(pt-br)

Olá pessoal,

Estou enfrentando dois problemas distintos no meu workspace Premium (AWS - região sa-east-1):

  1. Conectividade no Serverless: Não consigo consumir APIs que terminam em .com.br usando Serverless compute. Se a API for .com, funciona normalmente. O mesmo código funciona em um Cluster Clássico, o que sugere que o Serverless está lidando com o DNS ou com a saída de rede de forma diferente.

  • Já verifiquei os Security Groups e a porta 443 está aberta para 0.0.0.0/0.

  1. Spark "carregando infinitamente" no Cluster: Para contornar o problema acima, tentei usar um Cluster comum. O código de requisição API funciona, mas qualquer comando Spark (como ler um dataframe ou um simples count) fica processando infinitamente e não inicia o job.

Alguém já passou por algo parecido ou sabe se existe alguma configuração de VPC/Unity Catalog que possa estar causando esse conflito entre o tipo de computação e a resolução de domínios?

Obrigado!

1 REPLY 1

GaneshI
New Contributor III

Hi there,

Great breakdown of the symptoms — these are actually two distinct issues likely sharing a common root cause in your VPC/network configuration. Let me address both:


Issue 1: Serverless Compute — .com.br DNS Resolution Failure

Root Cause

Serverless compute in Databricks does NOT run inside your custom VPC. It runs in a Databricks-managed network and egresses through Databricks' own infrastructure. This means:

  • Your VPC's outbound Security Group rules (0.0.0.0/0 on port 443) do not apply to Serverless
  • Serverless traffic goes through Databricks-controlled egress, which may have its own DNS resolvers and egress filtering
  • .com.br TLD resolution can fail if the managed DNS used by Serverless doesn't properly resolve country-code TLDs (ccTLDs) or if those domains are not on Databricks' egress allowlist

Fix for Serverless Connectivity

Option 1 — Use Serverless Network Policies (Recommended) Databricks introduced Serverless Network Policies to control egress from Serverless compute. You need to explicitly allow the .com.br destinations:

  • Go to Account Console → Network → Serverless Network Policies
  • Add an egress policy that explicitly allows the target .com.br domains/IPs
  • This is the correct and supported way to control Serverless egress — Security Groups alone won't work

Option 2 — Contact Databricks Support If the .com.br domains are being blocked at the Databricks-managed egress layer (not your VPC), you'll need Support to confirm whether those ccTLDs are filtered and to whitelist them at the platform level for your workspace in sa-east-1.

Option 3 — Verify DNS explicitly In a Serverless notebook, run:

 
 
python
import sockettry:
    print(socket.getaddrinfo("yourtarget.com.br", 443))
except Exception as e:
    print(f"DNS failed: {e}")

This confirms whether it's a DNS resolution failure vs. a TCP/TLS connection block — important distinction for Support.


Issue 2: Classic Cluster — Spark Hanging Indefinitely

Root Cause

Classic clusters do run inside your VPC, so this is almost certainly a VPC networking/configuration problem. A Spark job hanging without starting (not failing — just hanging) typically points to:

Cause Explanation
Driver ↔ Executor communication blockedSecurity Groups may block internal cluster traffic on required ports
S3 / Metastore connectivity issueUnity Catalog metastore or S3 access is blocked, causing Spark context init to stall
Missing VPC EndpointsRequired AWS endpoints (S3, STS, KMS) may be missing, causing timeouts
DNS resolution failure inside VPCCustom VPC may have DNS hostnames/resolution not enabled

Fix for Classic Cluster Spark Hang

Step 1 — Check VPC DNS Settings (Most Common Fix)

In AWS Console → Your VPC → Actions:

  • Enable DNS hostnames → must be Yes
  • Enable DNS resolution → must be Yes

If either is disabled, Spark nodes can't resolve each other or AWS service endpoints — causing silent hangs.

Step 2 — Verify Security Group Inbound Rules for Internal Traffic

Databricks Classic Clusters require self-referencing inbound rules in the Security Group:

Type Protocol Port Range Source
All TCPTCP0–65535Same Security Group ID
All UDPUDP0–65535Same Security Group ID

Without this, Driver and Executor nodes can't communicate — Spark will silently hang.

Step 3 — Verify Required VPC Endpoints Exist

For Unity Catalog + AWS in a custom VPC, these endpoints are strongly recommended:

  • com.amazonaws.sa-east-1.s3 (Gateway type)
  • com.amazonaws.sa-east-1.sts
  • com.amazonaws.sa-east-1.kinesis-streams (if using streaming)

Missing S3 or STS endpoints in a private subnet will cause Spark to stall during initialization.

Step 4 — Check Cluster Event Logs In the Databricks UI → Cluster → Event Log tab, look for timeout or unreachable host errors that may not surface in the notebook itself.


Likely Common Root Cause

Both issues in a custom VPC in sa-east-1 point to an incomplete network configuration:

 
 
Serverless Issue  → VPC rules don't apply; need Serverless Network Policy for .com.brClassic Hang      → Missing self-referencing SG rules OR DNS not enabled in VPC

Recommended action order:

  1. Fix VPC DNS settings first
  2. Add self-referencing Security Group rules
  3. Add missing VPC Endpoints
  4. Add Serverless Network Policy for .com.br egress
  5. If Serverless still fails, open a Support ticket with the DNS test result above

Hope this helps unblock both issues! what is the cluster Event Log shows — that'll help narrow down the Classic Cluster hang further.