As of this morning we started receiving the following error message on a Databricks job with a single Pyspark Notebook task. The job has not had any code changes in 2 months. The cluster configuration has also not changed. The last successful run of the job was the previous night.
In this job we started encountering the following error message:
"Py4JJavaError: An error occurred while calling o1830.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task creation failed: com.databricks.unity.error.MissingCredentialScopeException: [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.. SQLSTATE: XXKUC com.databricks.unity.error.MissingCredentialScopeException: [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.. SQLSTATE: XXKUC"
At a high level, the job takes a list of string UUIDs, then reads from a Delta table stored in Unity catalog and filters that table on matches to that UUID. It then checkpoints the DataFrame halfway through the transformations, and lastly culminates in writing data to AWS S3 before moving on to the next UUID. It's very interesting to note, that on the first loop through, there are no issues encountered in the loop and the files are successfully written to S3. However every time the job fails, it fails on the second loop, where we encounter the above error message (consistently reproducible). If I kick off a separate job for each UUID in the original list, the job succeeds (this is a sub-optimal approach due to cluster startup time). To reiterate, this worked fine yesterday and none of the code has changed.
It appears this may be a bug with something that changed with the Unity Catalog service. Anyone have any ideas?