Hi @asim_mirza_12,
The javax.net.ssl.SSLException: Connection reset error you are seeing with Unity Catalog and S3 operations, while basic Spark jobs and curl commands work, typically points to a networking layer issue where the JVM's SSL handshake is being interrupted before it completes. Here are the areas to investigate.
VERIFY VPC ENDPOINT AND ROUTE TABLE CONFIGURATION
An S3 Gateway VPC endpoint is fully compatible with Unity Catalog and is actually recommended by Databricks. If you removed it, you should add it back. The key requirement is that the route table entries for the S3 gateway endpoint must be associated with the subnets where your Databricks clusters run.
To confirm proper configuration:
1. In the AWS Console, go to VPC > Endpoints and verify your S3 Gateway endpoint exists and is in the "Available" state.
2. Check that the endpoint's route tables include the route tables associated with your Databricks private subnets.
3. The route table should show a prefix list entry (pl-xxxxxxxx) pointing to the S3 gateway endpoint (vpce-xxxxxxxx).
If you are using regional S3 endpoints, also make sure you set the environment variable in your cluster configuration:
AWS_REGION=<your-aws-region>
Documentation: https://docs.databricks.com/aws/en/security/network/classic/customer-managed-vpc
CHECK REQUIRED OUTBOUND PORTS
Unity Catalog requires connectivity on several ports beyond just HTTPS/443. In particular, make sure your security groups and NACLs allow outbound traffic on these ports:
- Port 443: S3, STS, Databricks control plane
- Port 8443: Databricks control plane REST API
- Port 8444: Unity Catalog logging and lineage data streaming
- Port 8445-8451: Additional control plane services
If any of these are blocked, Unity Catalog operations (SHOW CATALOGS, schema/table creation) will fail even though basic Spark jobs may still work.
CHECK FOR STS ENDPOINT CONNECTIVITY
Unity Catalog relies on AWS STS for IAM role assumption when accessing S3 through storage credentials. If STS calls are failing or timing out, you will see SSL connection resets on the S3 side because the credentials never get issued.
Verify that:
1. Outbound access to sts.amazonaws.com:443 is allowed (or you have an STS VPC interface endpoint).
2. If using a VPC interface endpoint for STS, make sure "Enable private DNS name" is checked so that the global STS endpoint resolves to the private IP.
VERIFY SECURITY GROUP AND NACL RULES
For Databricks on a customer-managed VPC, the security group attached to your cluster subnets needs:
- Egress: All TCP to the workspace security group (self-referencing) and TCP to 0.0.0.0/0 on the required ports listed above.
- Ingress: All TCP and UDP from the workspace security group (self-referencing).
For NACLs, make sure the rules allowing Databricks traffic have the lowest rule numbers (highest priority). If you have added deny rules or custom NACLs, they may be intercepting traffic before the allow rules are evaluated.
VALIDATE THE IAM ROLE TRUST POLICY
For Unity Catalog to access S3, the IAM role used as the storage credential must have the correct trust policy allowing the Databricks Unity Catalog service principal to assume it. If this role assumption fails silently, the downstream S3 calls will fail with connection-level errors.
Confirm that:
1. The IAM role's trust policy includes the Databricks external ID for your account.
2. The role has s3:GetObject, s3:PutObject, s3:DeleteObject, s3:ListBucket, and s3:GetBucketLocation permissions on the target bucket.
3. No S3 bucket policy is explicitly denying access from the VPC or VPC endpoint.
CHECK FOR TLS VERSION OR CIPHER MISMATCH
The "Connection reset" error during SSL specifically can occur if there is a network appliance (firewall, proxy, or NAT device) performing TLS inspection that is terminating the connection. If you have any appliance doing deep packet inspection or SSL interception on the outbound path from your Databricks subnets, it may be stripping or modifying the TLS handshake.
To test this: from a notebook on the cluster, run:
%sh
openssl s_client -connect s3.amazonaws.com:443 -tls1_2
If this returns a certificate chain from Amazon and completes the handshake, the raw connectivity is fine and the issue is at the JVM/SDK level. If it fails or shows an unexpected certificate (from a proxy), that is your root cause.
QUICK DIAGNOSTIC CHECKLIST
Run these commands from a notebook on the affected cluster to narrow down the issue:
%sh
# Test S3 connectivity
curl -v https://s3.<your-region>.amazonaws.com 2>&1 | head -30
# Test STS connectivity
curl -v https://sts.amazonaws.com 2>&1 | head -30
# Test Unity Catalog control plane connectivity
curl -v https://<your-workspace-url>:8444 2>&1 | head -30
# Check DNS resolution
nslookup s3.<your-region>.amazonaws.com
If curl succeeds but the JVM-based operations fail, the issue is likely related to the JVM's trust store, a TLS inspection appliance, or a security group rule that allows initial TCP but resets during the SSL negotiation.
ADDITIONAL NOTE ON MARKETPLACE WORKSPACES
AWS Marketplace (QuickLaunch) workspaces use the same underlying infrastructure as standard Databricks deployments. The S3 gateway endpoint, STS interface endpoint, and all Unity Catalog networking requirements apply identically. The QuickLaunch provisioning process creates a default VPC configuration, but if you have customized the networking (added firewalls, changed NACLs, removed endpoints), those changes need to align with the requirements documented here: https://docs.databricks.com/aws/en/security/network/classic/customer-managed-vpc
If you have confirmed all of the above and the issue persists, I would recommend opening a support ticket with the output from the diagnostic commands above. The support team can correlate with control plane logs to pinpoint exactly where the SSL handshake is failing.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.