Louis_Frolio
Databricks Employee
Databricks Employee

Troubleshooting and Resolution for java.io.IOException: Invalid PKCS8 data

The error java.io.IOException: Invalid PKCS8 data typically occurs when there is an issue with the private key format or its storage in Databricks secrets. Based on the provided cluster Spark configurations and the referenced document, here are the potential causes and their resolutions:
Step 1: Validate the Private Key - Ensure the private key stored in the secret is in the correct PKCS8 format. - Sometimes, when copying or storing the key, additional spaces, newlines, or formatting issues can occur. Verify that the key matches the exact format listed in the JSON key file downloaded from Google Cloud Platform (GCP). - Example of a correct private key format in the JSON file:
"private_key": "-----BEGIN PRIVATE KEY-----\
MIIEvQI...\
-----END PRIVATE KEY-----\
"
Step 2: Check Databricks Secret Configuration - Ensure the private key and private key ID are properly stored in the Databricks secret. - Verify the secrets by running the following in a Databricks notebook:
dbutils.secrets.get(scope="newscope", key="gsaprivatekeynew")
dbutils.secrets.get(scope="newscope", key="gsaprivatekeyid")
```
- The secrets should correctly retrieve the values stored without additional whitespace or errors.

Step 3: Confirm Spark Configuration**
- Double-check if the cluster Spark configuration matches the setup described in the document:
  - *Service Account Email:* Ensure this matches the email value from your GCP service account JSON.
  - *Project ID:* Verify the project ID is correct and matches your GCP project.

Here is the corrected Spark configuration example:
```properties
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <service-account-email>
spark.hadoop.fs.gs.project.id <project-id>
spark.hadoop.fs.gs.auth.service.account.private.key {{secrets/newscope/gsaprivatekeynew}}
spark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets/newscope/gsaprivatekeyid}}
Step 4: Test with Minimal Configuration - Create a basic test with only the required Spark configuration to isolate the issue:
df = spark.read.format("csv").option("header", "true").load("gs://<bucket-name>/<path>")
df.show()