What changed
A production daily job that has worked unchanged for ~8 months started failing on 2026-05-18 ~23:46 UTC. The notebook does a plain spark.read.json("s3://BUCKET/...") against a bucket in af-south-1. The metastore is in eu-west-1. Same code, same cluster spec, same external location — just stopped working.
Three independent production jobs that all read from the same af-south-1 bucket failed within a ~3-hour window. Other jobs in the same workspace that don't touch this bucket continued to run fine. Nothing was deployed or reconfigured on our side.
The error
<html><body> shaded.databricks.org.apache.hadoop.fs.s3a.AWSBadRequestException:
getFileStatus on s3://BUCKET/activities/year=2026/month=5/day=18/log.json:
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request;
request: HEAD https://BUCKET.s3.af-south-1.amazonaws.com activities/year%3D2026/month%3D5/day%3D18/log.json
...
(Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; ...) </body></html>
Note Error Code: 400 Bad Request — that's the HTTP status text echoed back, not a proper S3 error code (InvalidArgument, InvalidRequest, etc.). The response body is empty. This is the signature you typically see when an unsigned or wrongly-signed request hits an opt-in region, because af-south-1 enforces SigV4 with the correct signing region.
Setup
- Metastore: aws:eu-west-1
- Bucket: af-south-1 (opt-in region, account opt-in status is still ENABLED)
- UC external location pointed at the af-south-1 bucket, storage credential is an AWS IAM role
- Test Connection in the External Location UI still passes (green on Read, List, Write, Delete, Path Exists, Assume Role, etc.)
- Job cluster has no instance profile — it relies on UC credential vending for S3 access
- Spark code uses the s3:// scheme (not s3a://)
Reproduces across every compute kind we tried
This is the key data point — rules out a single runtime patch as the cause:
Classic job cluster, DBR 17.3 LTS (SINGLE_USER, CLASSIC_PREVIEW)
- Hadoop 3.4.2, aws-sdk-java 1.12.681, Scala 2.13.16
- OS: amzn2023 aarch64
- Result: 400
Classic interactive cluster (older DBR)
- Hadoop 3.3.6, aws-sdk-java 1.12.638, Scala 2.12.15
- OS: Linux 5.15 x86_64
- Result: 400
Serverless compute, Base environment v1
- Hadoop 3.4.2, aws-sdk-java 1.12.681, Scala 2.13.16
- OS: amzn2023 x86_64
- Result: 400
Three Hadoop versions, two Scala versions, two architectures — all fail identically against af-south-1.
What I verified and ruled out
- AWS region opt-in: af-south-1 still ENABLED on the account (aws account get-region-opt-status)
- AWS IAM: zero CloudTrail events touching the storage-credential IAM role in the breakage window
- Bucket config: zero CloudTrail events on the bucket; bucket policy still absent (IAM-only)
- AWS Organizations SCPs: the account isn't a member of any organization
- Databricks audit (system.access.audit): zero write events on unityCatalog, clusters, or iam in the window
- Status page: all green during the incident
- Databricks Runtime maintenance updates for 17.3 LTS published 2026-05-10 to 2026-05-19: only one entry (May 13, Delta sharing client lib bump) — nothing S3/UC related
- Delete and recreate the external location: does not fix it
- UC vending API itself: still returns 200. From system.access.audit, generateTemporaryPathCredential calls on the last-successful-day vs first-failed-day are byte-identical (same credential_id, url, operation, credential_kind). UC hands back STS credentials successfully — those credentials then fail when used against af-south-1.
So the API succeeds but the credentials it returns are now poisoned for cross-region. My best guess is the underlying AssumeRole either started attaching a region-scoping session policy or switched from a global to a regional STS endpoint, but I can't see inside the vending implementation to confirm.
Proof it's specifically the region pair
I copied one of the failing source files to a fresh eu-west-1 bucket, registered a new UC external location for it, and ran the same Spark code from the same compute:
- spark.read.json against the af-south-1 path returned 400
- spark.read.json against the eu-west-1 copy succeeded
Same code, same cluster, same UC infrastructure pattern — only the bucket region differs.