cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cross-region S3 reads suddenly fail with 400 Bad Request — eu-west-1 metastore to af-south-1 bucket

Bank_Kirati
New Contributor III

What changed

A production daily job that has worked unchanged for ~8 months started failing on 2026-05-18 ~23:46 UTC. The notebook does a plain spark.read.json("s3://BUCKET/...") against a bucket in af-south-1. The metastore is in eu-west-1. Same code, same cluster spec, same external location — just stopped working.

Three independent production jobs that all read from the same af-south-1 bucket failed within a ~3-hour window. Other jobs in the same workspace that don't touch this bucket continued to run fine. Nothing was deployed or reconfigured on our side.

The error

<html><body> shaded.databricks.org.apache.hadoop.fs.s3a.AWSBadRequestException:
getFileStatus on s3://BUCKET/activities/year=2026/month=5/day=18/log.json:
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request;
request: HEAD https://BUCKET.s3.af-south-1.amazonaws.com activities/year%3D2026/month%3D5/day%3D18/log.json
...
(Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; ...) </body></html>

 

Note Error Code: 400 Bad Request — that's the HTTP status text echoed back, not a proper S3 error code (InvalidArgument, InvalidRequest, etc.). The response body is empty. This is the signature you typically see when an unsigned or wrongly-signed request hits an opt-in region, because af-south-1 enforces SigV4 with the correct signing region.

Setup

  • Metastore: aws:eu-west-1
  • Bucket: af-south-1 (opt-in region, account opt-in status is still ENABLED)
  • UC external location pointed at the af-south-1 bucket, storage credential is an AWS IAM role
  • Test Connection in the External Location UI still passes (green on Read, List, Write, Delete, Path Exists, Assume Role, etc.)
  • Job cluster has no instance profile — it relies on UC credential vending for S3 access
  • Spark code uses the s3:// scheme (not s3a://)

Reproduces across every compute kind we tried

This is the key data point — rules out a single runtime patch as the cause:

Classic job cluster, DBR 17.3 LTS (SINGLE_USER, CLASSIC_PREVIEW)

  • Hadoop 3.4.2, aws-sdk-java 1.12.681, Scala 2.13.16
  • OS: amzn2023 aarch64
  • Result: 400

Classic interactive cluster (older DBR)

  • Hadoop 3.3.6, aws-sdk-java 1.12.638, Scala 2.12.15
  • OS: Linux 5.15 x86_64
  • Result: 400

Serverless compute, Base environment v1

  • Hadoop 3.4.2, aws-sdk-java 1.12.681, Scala 2.13.16
  • OS: amzn2023 x86_64
  • Result: 400

Three Hadoop versions, two Scala versions, two architectures — all fail identically against af-south-1.

What I verified and ruled out

  • AWS region opt-in: af-south-1 still ENABLED on the account (aws account get-region-opt-status)
  • AWS IAM: zero CloudTrail events touching the storage-credential IAM role in the breakage window
  • Bucket config: zero CloudTrail events on the bucket; bucket policy still absent (IAM-only)
  • AWS Organizations SCPs: the account isn't a member of any organization
  • Databricks audit (system.access.audit): zero write events on unityCatalog, clusters, or iam in the window
  • Status page: all green during the incident
  • Databricks Runtime maintenance updates for 17.3 LTS published 2026-05-10 to 2026-05-19: only one entry (May 13, Delta sharing client lib bump) — nothing S3/UC related
  • Delete and recreate the external location: does not fix it
  • UC vending API itself: still returns 200. From system.access.audit, generateTemporaryPathCredential calls on the last-successful-day vs first-failed-day are byte-identical (same credential_id, url, operation, credential_kind). UC hands back STS credentials successfully — those credentials then fail when used against af-south-1.

So the API succeeds but the credentials it returns are now poisoned for cross-region. My best guess is the underlying AssumeRole either started attaching a region-scoping session policy or switched from a global to a regional STS endpoint, but I can't see inside the vending implementation to confirm.

Proof it's specifically the region pair

I copied one of the failing source files to a fresh eu-west-1 bucket, registered a new UC external location for it, and ran the same Spark code from the same compute:

  • spark.read.json against the af-south-1 path returned 400
  • spark.read.json against the eu-west-1 copy succeeded

Same code, same cluster, same UC infrastructure pattern — only the bucket region differs.

1 REPLY 1

sameer_yasser
New Contributor

Your debugging is really thorough and you've already done the hard work of isolating this. The 400 with an empty body (no proper S3 error code like InvalidArgument) on an opt-in region is almost always one thing: SigV4 signing region mismatch. af-south-1 strictly enforces that the request is signed with af-south-1 as the signing region — if the SDK signs it with eu-west-1 or falls back to a global endpoint, S3 rejects it with exactly this signature.

Your observation that UC vending still returns credentials successfully but those credentials then fail is the giveaway. Something on the Databricks side recently changed how those STS credentials are being scoped or which STS endpoint is being used to generate them — regional STS tokens can carry implicit region restrictions that bite you on cross-region opt-in buckets.

Worth trying as a workaround — add these Spark configs at the cluster level (or session level to test first):

fs.s3a.bucket.<YOUR-BUCKET-NAME>.endpoint s3.af-south-1.amazonaws.com
fs.s3a.bucket.<YOUR-BUCKET-NAME>.endpoint.region af-south-1
This forces the S3A client to sign requests for that specific bucket using the correct region regardless of where the metastore sits. Bucket-scoped configs won't affect your other jobs.

That said — given this started on a specific date with no changes on your end and affects UC credential vending internally, you should also open a Databricks Support ticket and reference the date (2026-05-18 ~23:46 UTC). Your audit trail evidence is clean and this has all the hallmarks of a platform-side change to their STS/credential vending layer.

Let us know if the Spark config workaround helps in the meantime.