Databricks Community

ShivangiB · 2 weeks ago

While running :

df = spark.read.format("csv") \

.option("header", "true") \

.option("inferSchema", "true") \

.load('path')

df.show()

Getting error : java.io.IOException: Invalid PKCS8 data.

Cluster Spark Config :

spark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets/newscope/gsaprivatekeyid}} spark.hadoop.fs.gs.auth.service.account.private.key {{secrets/newscope/gsaprivatekeynew}} spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.fs.gs.project.id <projectid> spark.hadoop.fs.gs.auth.service.account.email <email id>

Have followed the document : Connect to Google Cloud Storage - Azure Databricks | Microsoft Learn

Please Help

BigRoux · 2 weeks ago

Troubleshooting and Resolution for `java.io.IOException: Invalid PKCS8 data`

The error java.io.IOException: Invalid PKCS8 data typically occurs when there is an issue with the private key format or its storage in Databricks secrets. Based on the provided cluster Spark configurations and the referenced document, here are the potential causes and their resolutions:

Step 1: Validate the Private Key - Ensure the private key stored in the secret is in the correct PKCS8 format. - Sometimes, when copying or storing the key, additional spaces, newlines, or formatting issues can occur. Verify that the key matches the exact format listed in the JSON key file downloaded from Google Cloud Platform (GCP). - Example of a correct private key format in the JSON file:

"private_key": "-----BEGIN PRIVATE KEY-----\
MIIEvQI...\
-----END PRIVATE KEY-----\
"

Step 2: Check Databricks Secret Configuration - Ensure the private key and private key ID are properly stored in the Databricks secret. - Verify the secrets by running the following in a Databricks notebook:

dbutils.secrets.get(scope="newscope", key="gsaprivatekeynew")
dbutils.secrets.get(scope="newscope", key="gsaprivatekeyid")
```
- The secrets should correctly retrieve the values stored without additional whitespace or errors.

Step 3: Confirm Spark Configuration**
- Double-check if the cluster Spark configuration matches the setup described in the document:
  - *Service Account Email:* Ensure this matches the email value from your GCP service account JSON.
  - *Project ID:* Verify the project ID is correct and matches your GCP project.

Here is the corrected Spark configuration example:
```properties
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <service-account-email>
spark.hadoop.fs.gs.project.id <project-id>
spark.hadoop.fs.gs.auth.service.account.private.key {{secrets/newscope/gsaprivatekeynew}}
spark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets/newscope/gsaprivatekeyid}}

Step 4: Test with Minimal Configuration - Create a basic test with only the required Spark configuration to isolate the issue:

df = spark.read.format("csv").option("header", "true").load("gs://<bucket-name>/<path>")
df.show()

ShivangiB · 2 weeks ago

@BigRoux can you please suggest .

what value should i store in private key, just the part between begin and end. As I am saving that only still getting error.

ShivangiB · 2 weeks ago

what value should i store in private key, just the part between begin and end. As I am saving that only still getting error.

BigRoux · 2 weeks ago

What is the error you are getting? More context is needed here.

ShivangiB · 2 weeks ago

same error java.io.IOException: Invalid PKCS8 data.

"private_key": "-----BEGIN PRIVATE KEY-----\n --have stored this value present between these two--\n-----END PRIVATE KEY-----\n",

BigRoux · 2 weeks ago

Here is an example of a properly formatted and delimited PKCS#8 private key in PEM format. This format includes the required headers and footers:

```
-----BEGIN PRIVATE KEY-----
MIIBVgIBADANBgkqhkiG9w0BAQEFAASCAUAwggE8AgEAAkEAq7BFUpkGp3+LQmlQ
Yx2eqzDV+xeG8kx/sQFV18S5JhzGeIJNA72wSeukEPojtqUyX2J0CciPBh7eqclQ
2zpAswIDAQABAkAgisq4+zRdrzkwH1ITV1vpytnkO/NiHcnePQiOW0VUybPyHoGM
/jf75C5xET7ZQpBe5kx5VHsPZj0CBb3b+wSRAiEA2mPWCBytosIU/ODRfq6EiV04
lt6waE7I2uSPqIC20LcCIQDJQYIHQII+3YaPqyhGgqMexuuuGx+lDKD6/Fu/JwPb
5QIhAKthiYcYKlL9h8bjDsQhZDUACPasjzdsDEdq8inDyLOFAiEAmCr/tZwA3qeA
ZoBzI10DGPIuoKXBd3nk/eBxPkaxlEECIQCNymjsoI7GldtujVnr1qT+3yedLfHK
srDVjIT3LsvTqw==
-----END PRIVATE KEY-----
```

Explanation:
- Headers and Footers: The key begins with `-----BEGIN PRIVATE KEY-----` and ends with `-----END PRIVATE KEY-----`. These delimiters are mandatory in PEM format.
- Base64 Encoding: The content between the headers is the Base64-encoded representation of the private key data.
- Line Breaks: The encoded data is split into lines of 64 characters for readability, though this is not strictly required by all tools.

This format is widely used for storing private keys in PKCS#8 syntax, which supports various cryptographic algorithms.

Further, if you are still encountering problems I would suggest you try using Databricks Secret scopes. This way you don't have to expose a key which is a security anti-pattern.

Cheers, Louis.

ShivangiB · 2 weeks ago

@BigRoux after updating the key we are getting different error:

java.io.IOException: Error accessing gs://gcp-storage/FlatFiles/test_data.csv

---------------------------------------------------------------------------

Py4JJavaError Traceback (most recent call last)

File <command-5419098352410353>, line 4

1 df = spark.read.format("csv") \

2 .option("header", "true") \

3 .option("inferSchema", "true") \

----> 4 .load('gs://gcp-storage/FlatFiles/test_data.csv')

6 df.show()

Py4JJavaError: An error occurred while calling o407.load.

: java.io.IOException: Error accessing gs://gcp-storage/FlatFiles/test_data.csv

at shaded.databricks.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2140)

Caused by: shaded.databricks.com.google.api.client.auth.oauth2.TokenResponseException: 400 Bad Request

POST https://oauth2.googleapis.com/token" target="_blank" rel="noopener noreferrer">https://oauth2.googleapis.com/token</a>

{

"error" : "invalid_grant",

"error_description" : "Invalid grant: account not found"

}

at shaded.databricks.com.google.api.client.auth.oauth2.TokenResponseException.from(TokenResponseException.java:103)

at shaded.databricks.com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:308)

at shaded.databricks.com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:324)

at shaded.databricks.com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:170)

at shaded.databricks.com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:470)

at shaded.databricks.com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:201)

at shaded.databricks.com.google.cloud.hadoop.util.ChainingHttpRequestInitializer$2.intercept(ChainingHttpRequestInitializer.java:98)

at shaded.databricks.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:880)

at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)

at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)

at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)

at shaded.databricks.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2134)

BigRoux · 2 weeks ago

At this point it is out of my area of knowledge and I don't havey any further suggestions. You may want to consider contacting Databricks Support if you have a support contract.