cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Not Able To Access GCP storage bucket from Databricks

ShivangiB
New Contributor III

While running :

df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('path')

df.show()

Getting error : java.io.IOException: Invalid PKCS8 data.

Cluster Spark Config :

spark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets/newscope/gsaprivatekeyid}} spark.hadoop.fs.gs.auth.service.account.private.key {{secrets/newscope/gsaprivatekeynew}} spark.hadoop.google.cloud.auth.service.account.enable true spark.hadoop.fs.gs.project.id <projectid> spark.hadoop.fs.gs.auth.service.account.email <email id>

Have followed the document : Connect to Google Cloud Storage - Azure Databricks | Microsoft Learn

Please Help

8 REPLIES 8

BigRoux
Databricks Employee
Databricks Employee

Troubleshooting and Resolution for java.io.IOException: Invalid PKCS8 data

The error java.io.IOException: Invalid PKCS8 data typically occurs when there is an issue with the private key format or its storage in Databricks secrets. Based on the provided cluster Spark configurations and the referenced document, here are the potential causes and their resolutions:
Step 1: Validate the Private Key - Ensure the private key stored in the secret is in the correct PKCS8 format. - Sometimes, when copying or storing the key, additional spaces, newlines, or formatting issues can occur. Verify that the key matches the exact format listed in the JSON key file downloaded from Google Cloud Platform (GCP). - Example of a correct private key format in the JSON file:
"private_key": "-----BEGIN PRIVATE KEY-----\
MIIEvQI...\
-----END PRIVATE KEY-----\
"
Step 2: Check Databricks Secret Configuration - Ensure the private key and private key ID are properly stored in the Databricks secret. - Verify the secrets by running the following in a Databricks notebook:
dbutils.secrets.get(scope="newscope", key="gsaprivatekeynew")
dbutils.secrets.get(scope="newscope", key="gsaprivatekeyid")
```
- The secrets should correctly retrieve the values stored without additional whitespace or errors.

Step 3: Confirm Spark Configuration**
- Double-check if the cluster Spark configuration matches the setup described in the document:
  - *Service Account Email:* Ensure this matches the email value from your GCP service account JSON.
  - *Project ID:* Verify the project ID is correct and matches your GCP project.

Here is the corrected Spark configuration example:
```properties
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <service-account-email>
spark.hadoop.fs.gs.project.id <project-id>
spark.hadoop.fs.gs.auth.service.account.private.key {{secrets/newscope/gsaprivatekeynew}}
spark.hadoop.fs.gs.auth.service.account.private.key.id {{secrets/newscope/gsaprivatekeyid}}
Step 4: Test with Minimal Configuration - Create a basic test with only the required Spark configuration to isolate the issue:
df = spark.read.format("csv").option("header", "true").load("gs://<bucket-name>/<path>")
df.show()
 

ShivangiB
New Contributor III

@BigRoux  can you please suggest . 

what value should i store in private key, just the part between begin and end. As I am saving that only still getting error.

ShivangiB
New Contributor III

what value should i store in private key, just the part between begin and end. As I am saving that only still getting error.

 

BigRoux
Databricks Employee
Databricks Employee

What is the error you are getting? More context is needed here.

ShivangiB
New Contributor III

same error java.io.IOException: Invalid PKCS8 data.

"private_key": "-----BEGIN PRIVATE KEY-----\n --have stored this value present between these two--\n-----END PRIVATE KEY-----\n",

BigRoux
Databricks Employee
Databricks Employee

Here is an example of a properly formatted and delimited PKCS#8 private key in PEM format. This format includes the required headers and footers:

```
-----BEGIN PRIVATE KEY-----
MIIBVgIBADANBgkqhkiG9w0BAQEFAASCAUAwggE8AgEAAkEAq7BFUpkGp3+LQmlQ
Yx2eqzDV+xeG8kx/sQFV18S5JhzGeIJNA72wSeukEPojtqUyX2J0CciPBh7eqclQ
2zpAswIDAQABAkAgisq4+zRdrzkwH1ITV1vpytnkO/NiHcnePQiOW0VUybPyHoGM
/jf75C5xET7ZQpBe5kx5VHsPZj0CBb3b+wSRAiEA2mPWCBytosIU/ODRfq6EiV04
lt6waE7I2uSPqIC20LcCIQDJQYIHQII+3YaPqyhGgqMexuuuGx+lDKD6/Fu/JwPb
5QIhAKthiYcYKlL9h8bjDsQhZDUACPasjzdsDEdq8inDyLOFAiEAmCr/tZwA3qeA
ZoBzI10DGPIuoKXBd3nk/eBxPkaxlEECIQCNymjsoI7GldtujVnr1qT+3yedLfHK
srDVjIT3LsvTqw==
-----END PRIVATE KEY-----
```

Explanation:
- Headers and Footers: The key begins with `-----BEGIN PRIVATE KEY-----` and ends with `-----END PRIVATE KEY-----`. These delimiters are mandatory in PEM format.
- Base64 Encoding: The content between the headers is the Base64-encoded representation of the private key data.
- Line Breaks: The encoded data is split into lines of 64 characters for readability, though this is not strictly required by all tools.

This format is widely used for storing private keys in PKCS#8 syntax, which supports various cryptographic algorithms.

 

Further, if you are still encountering problems I would suggest you try using Databricks Secret scopes.  This way you don't have to expose a key which is a security anti-pattern.

 

Cheers, Louis.

 

Let me show you how to load a Private Key stored in the PEM PKCS#8 file in Java. We will read the file, parse it, remove not needed header and footer and create a new java.security.PrivateKey object that can be later used in our App for cryptography needs. PEM PKCS#8 is a format for storing ...
Let me show you how to load a Private Key stored in the PEM PKCS#8 file in Java. We will read the file, parse it, remove not needed header and footer and create a new java.security.PrivateKey object that can be later used in our App for cryptography needs. PEM PKCS#8 is a format for storing ...

ShivangiB
New Contributor III

@BigRoux after updating the key we are getting different error:

java.io.IOException: Error accessing gs://gcp-storage/FlatFiles/test_data.csv
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
File <command-5419098352410353>, line 4
      1 df = spark.read.format("csv") \
      2     .option("header", "true") \
      3     .option("inferSchema", "true") \
----> 4     .load('gs://gcp-storage/FlatFiles/test_data.csv')
      6 df.show()
 
 
Py4JJavaError: An error occurred while calling o407.load.
: java.io.IOException: Error accessing gs://gcp-storage/FlatFiles/test_data.csv
at shaded.databricks.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2140)
Caused by: shaded.databricks.com.google.api.client.auth.oauth2.TokenResponseException: 400 Bad Request
POST https://oauth2.googleapis.com/token" target="_blank" rel="noopener noreferrer">https://oauth2.googleapis.com/token</a>
{
  "error" : "invalid_grant",
  "error_description" : "Invalid grant: account not found"
}
at shaded.databricks.com.google.api.client.auth.oauth2.TokenResponseException.from(TokenResponseException.java:103)
at shaded.databricks.com.google.api.client.auth.oauth2.TokenRequest.executeUnparsed(TokenRequest.java:308)
at shaded.databricks.com.google.api.client.auth.oauth2.TokenRequest.execute(TokenRequest.java:324)
at shaded.databricks.com.google.cloud.hadoop.util.CredentialFactory$GoogleCredentialWithRetry.executeRefreshToken(CredentialFactory.java:170)
at shaded.databricks.com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:470)
at shaded.databricks.com.google.api.client.auth.oauth2.Credential.intercept(Credential.java:201)
at shaded.databricks.com.google.cloud.hadoop.util.ChainingHttpRequestInitializer$2.intercept(ChainingHttpRequestInitializer.java:98)
at shaded.databricks.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:880)
at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at shaded.databricks.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at shaded.databricks.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:2134)
 

BigRoux
Databricks Employee
Databricks Employee

At this point it is out of my area of knowledge and I don't havey any further suggestions. You may want to consider contacting Databricks Support if you have a support contract.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now