Databricks Community

huytran · ‎12-01-2024

I am following this guide on writing data to the BigQuery table.

Right now, I have an error when I try to write data using Databricks Secret instead of the JSON credential file and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

java.io.IOException: Error getting access token from metadata server at: http://169.x.x.x/computeMetadata/v1/instance/service-accounts/default/token

What is strange here is I can write/read the data using the GOOGLE_APPLICATION_CREDENTIALS environment variable. I can also read the data using Databricks Secret. So I don't know why it is not work when I write the data with Databricks Secret.

Here is my code to read the Databricks Secret:

import base64

cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred)
cred = cred.decode('ascii')
spark.conf.set("credentials", cred)

Below is my code to read/write the data:

# Read data
df = spark.read.format("bigquery")
    .option("parentProject", <parent-project-id>)
    .option("viewsEnabled","true")
    .option("table", <table-name>)
    .load()

# Write data
df.write.format("bigquery") \
  .mode("overwrite") \
  .option("temporaryGcsBucket", <bucket-name>) \
  .option("table", <table-name>) \
  .option("parentProject", <parent-project-id>) \
  .save()

Am I missing any configuration for writing the data to BigQuery with Databricks Secret?

VZLA · ‎12-06-2024

Hello, thank you for your question.

To address your issue, the problem likely arises because the BigQuery connector expects a JSON credentials file or an environment variable pointing to it. Using a Databricks Secret directly requires extra steps. Here's how you can resolve it:

Retrieve the JSON credentials from the Databricks Secret as a raw string, avoiding Base64 encoding. Ensure the JSON is passed exactly as it was stored in the secret.
Dynamically create a temporary file in your environment to store the credentials. Use a method to write the JSON credentials to a file and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of this file.
Confirm that your Spark session has the required configurations set for BigQuery, including the temporary GCS bucket and parent project. Ensure the service account in your credentials has permissions for both the BigQuery table and the GCS bucket.
Double-check that your temporary GCS bucket is properly configured and accessible by the service account, as write operations often rely on this intermediate storage.

If these steps don't resolve the issue, feel free to share additional details for further assistance.

huytran · ‎12-09-2024

Hi @VZLA

I tried to get the Databricks Secret, write to a temporary file, and set the temporary file path to GOOGLE_APPLICATION_CREDENTIALS with the code below. And yes, I have to encode the Secret value so I can write it to a temporary file.

cred = dbutils.secrets.get("bigquery-cred", "project-id-name").encode('utf-8')
fd = os.open("/tmp/cred.json", os.O_RDWR|os.O_CREAT)
ret = os.write(fd, cred) 
os.close(fd)

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/tmp/cred.json'

Then I got this error:

Py4JJavaError: An error occurred while calling o677.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

I guess it doesn't get the value from the temporary file. Could you check if I use the correct code to write the file?

VZLA · ‎12-09-2024

Interesting, maybe there's a problem with the encoding, can you please give this a try?

import os

# Retrieve the JSON credentials from Databricks Secrets
cred = dbutils.secrets.get("bigquery-cred", "project-id-name")

# Write the JSON to a temporary file
temp_cred_path = "/tmp/cred.json"
with open(temp_cred_path, "w") as temp_cred_file:
    temp_cred_file.write(cred)

# Set the environment variable to point to the JSON file
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = temp_cred_path

# Verify the environment variable and file content (for debugging)
print("GOOGLE_APPLICATION_CREDENTIALS:", os.environ['GOOGLE_APPLICATION_CREDENTIALS'])
with open(temp_cred_path, "r") as temp_cred_file:
    print(temp_cred_file.read())

huytran · ‎12-09-2024

Hi @VZLA

The code returns below. The 2nd print command returns "[REDACTED]" and I think it is because we don't encode it.

GOOGLE_APPLICATION_CREDENTIALS: /tmp/cred.json
[REDACTED]

And I got the same error above:

Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

More information, the way I create the Databricks Secret is I have the file.json value below:

{
        "type": "service_account",
        "project_id": "xxxx-yyy",
        "private_key_id": "xxxxxxx",
        "private_key": "-----BEGIN PRIVATE KEY----------END PRIVATE KEY-----\n",
        "client_email": "xxx@yyyy.iam.gserviceaccount.com",
        "client_id": "xxxxx",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx%40xxxx.iam.gserviceaccount.com"
}

Then I used this command to create the Secret:

databricks secrets put-secret bigquery-cred project-id-name --string-value "$(cat file.json)"

VZLA · ‎12-09-2024

The json and command looks good to me.

Is it feasible for you to attach the full stacktrace, and/or check if this is applicable to your cluster setup as well?

https://community.databricks.com/t5/community-platform-discussions/bigquery-in-notebook-failing-with...

huytran · ‎12-09-2024

@VZLA

I checked my cluster setup and it is a shared cluster. Many people in my company can use it for their notebook.

And here is the full stackstrace:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-846265064717828> in <module>
      1 # this exports to Optus Big Query -- note "overwrite" in method def above
----> 2 write_to_bigquery(table_name, bucket, dataset, project_id)

<command-846265064717827> in write_to_bigquery(tableName, tempBucket, df, project_id)
      1 def write_to_bigquery(tableName, tempBucket, df, project_id):
----> 2   df.write.format("bigquery") \
      3   .option("temporaryGcsBucket", tempBucket) \
      4   .option("parentProject", project_id) \
      5   .option("table", tableName) \

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    736             self.format(format)
    737         if path is None:
--> 738             self._jwrite.save()
    739         else:
    740             self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

1 error
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:110)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:174)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:245)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:393)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:192)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:979)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:147)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:343)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:174)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:170)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:590)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:168)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:590)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:566)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:170)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:170)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:155)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:146)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:200)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.UncheckedIOException: Failed to create default Credentials
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:101)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.<init>(BigQueryCredentialsSupplier.java:50)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:53)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule$$FastClassByGuice$$b1b60333.invoke(<generated>)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1094)
	... 45 more
Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:134)
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:124)
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:96)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:99)
	... 64 more

I found this error:

Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

Does it mean the code cannot read the credential in the temporary file?

VZLA · ‎12-09-2024

It seems that nothing is being loaded into the GOOGLE_APPLICATION_CREDENTIALS.

From https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md

# The JSON keyfile of the service account used for GCS
# access when google.cloud.auth.service.account.enable is true.
spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile

I think its worth trying this before raising a support ticket as there are many details that could be playing a role here. I think os.environ["GOOGLE_APPLICATION_CREDENTIALS"], should've worked to be honest. So, I would expected the following to be a good starting point -summarizing what we've done so far-:

Here is how you can modify your code to ensure it works correctly:

Reading the Databricks Secret and Setting the Spark Configuration:

import base64

# Read the secret from Databricks Secrets
cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred).decode('ascii')

# Set the credentials in Spark configuration
spark.conf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", cred)

Reading Data from BigQuery:

# Read data from BigQuery
df = spark.read.format("bigquery") \
    .option("parentProject", "<parent-project-id>") \
    .option("viewsEnabled", "true") \
    .option("table", "<table-name>") \
    .load()

Writing Data to BigQuery:

# Write data to BigQuery
df.write.format("bigquery") \
    .mode("overwrite") \
    .option("temporaryGcsBucket", "<bucket-name>") \
    .option("table", "<table-name>") \
    .option("parentProject", "<parent-project-id>") \
    .save()

About the stacktrace itself, it simply looks like indeed the GOOGLE_APPLICATION_CREDENTIALS isn't set or is inaccessible, and none of the other defaultt credential sources are available:

https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth...
- Consider this https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth...
https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth...
- I think it eventually goes through https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth... and attempts the other methods for a couple of second before bailing out.

Databricks Community

Cannot write data to BigQuery when using Databricks secret

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks