cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot write data to BigQuery when using Databricks secret

huytran
New Contributor III

I am following this guide on writing data to the BigQuery table.

Right now, I have an error when I try to write data using Databricks Secret instead of the JSON credential file and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

 

java.io.IOException: Error getting access token from metadata server at: http://169.x.x.x/computeMetadata/v1/instance/service-accounts/default/token

 

What is strange here is I can write/read the data using the GOOGLE_APPLICATION_CREDENTIALS environment variable. I can also read the data using Databricks Secret. So I don't know why it is not work when I write the data with Databricks Secret.

Here is my code to read the Databricks Secret:

import base64

cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred)
cred = cred.decode('ascii')
spark.conf.set("credentials", cred)

Below is my code to read/write the data:

# Read data
df = spark.read.format("bigquery")
    .option("parentProject", <parent-project-id>)
    .option("viewsEnabled","true")
    .option("table", <table-name>)
    .load()

# Write data
df.write.format("bigquery") \
  .mode("overwrite") \
  .option("temporaryGcsBucket", <bucket-name>) \
  .option("table", <table-name>) \
  .option("parentProject", <parent-project-id>) \
  .save()

Am I missing any configuration for writing the data to BigQuery with Databricks Secret?

 
7 REPLIES 7

VZLA
Databricks Employee
Databricks Employee

Hello, thank you for your question. 

To address your issue, the problem likely arises because the BigQuery connector expects a JSON credentials file or an environment variable pointing to it. Using a Databricks Secret directly requires extra steps. Here's how you can resolve it:

  • Retrieve the JSON credentials from the Databricks Secret as a raw string, avoiding Base64 encoding. Ensure the JSON is passed exactly as it was stored in the secret.
  • Dynamically create a temporary file in your environment to store the credentials. Use a method to write the JSON credentials to a file and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of this file.
  • Confirm that your Spark session has the required configurations set for BigQuery, including the temporary GCS bucket and parent project. Ensure the service account in your credentials has permissions for both the BigQuery table and the GCS bucket.
  • Double-check that your temporary GCS bucket is properly configured and accessible by the service account, as write operations often rely on this intermediate storage.

If these steps don't resolve the issue, feel free to share additional details for further assistance.

 

huytran
New Contributor III

Hi @VZLA 

I tried to get the Databricks Secret, write to a temporary file, and set the temporary file path to GOOGLE_APPLICATION_CREDENTIALS with the code below. And yes, I have to encode the Secret value so I can write it to a temporary file.

cred = dbutils.secrets.get("bigquery-cred", "project-id-name").encode('utf-8')
fd = os.open("/tmp/cred.json", os.O_RDWR|os.O_CREAT)
ret = os.write(fd, cred) 
os.close(fd)

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/tmp/cred.json'

 Then I got this error:

Py4JJavaError: An error occurred while calling o677.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

I guess it doesn't get the value from the temporary file. Could you check if I use the correct code to write the file?

VZLA
Databricks Employee
Databricks Employee

Interesting, maybe there's a problem with the encoding, can you please give this a try?

import os

# Retrieve the JSON credentials from Databricks Secrets
cred = dbutils.secrets.get("bigquery-cred", "project-id-name")

# Write the JSON to a temporary file
temp_cred_path = "/tmp/cred.json"
with open(temp_cred_path, "w") as temp_cred_file:
    temp_cred_file.write(cred)

# Set the environment variable to point to the JSON file
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = temp_cred_path

# Verify the environment variable and file content (for debugging)
print("GOOGLE_APPLICATION_CREDENTIALS:", os.environ['GOOGLE_APPLICATION_CREDENTIALS'])
with open(temp_cred_path, "r") as temp_cred_file:
    print(temp_cred_file.read())

 

huytran
New Contributor III

Hi @VZLA 

The code returns below. The 2nd print command returns "[REDACTED]" and I think it is because we don't encode it.

GOOGLE_APPLICATION_CREDENTIALS: /tmp/cred.json
[REDACTED]

 And I got the same error above:

Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

More information, the way I create the Databricks Secret is I have the file.json value below:

{
        "type": "service_account",
        "project_id": "xxxx-yyy",
        "private_key_id": "xxxxxxx",
        "private_key": "-----BEGIN PRIVATE KEY----------END PRIVATE KEY-----\n",
        "client_email": "xxx@yyyy.iam.gserviceaccount.com",
        "client_id": "xxxxx",
        "auth_uri": "https://accounts.google.com/o/oauth2/auth",
        "token_uri": "https://oauth2.googleapis.com/token",
        "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
        "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx%40xxxx.iam.gserviceaccount.com"
}

Then I used this command to create the Secret:

databricks secrets put-secret bigquery-cred project-id-name --string-value "$(cat file.json)"

 

VZLA
Databricks Employee
Databricks Employee

The json and command looks good to me.

Is it feasible for you to attach the full stacktrace, and/or check if this is applicable to your cluster setup as well? 

https://community.databricks.com/t5/community-platform-discussions/bigquery-in-notebook-failing-with...

huytran
New Contributor III

@VZLA 

I checked my cluster setup and it is a shared cluster. Many people in my company can use it for their notebook.

And here is the full stackstrace:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<command-846265064717828> in <module>
      1 # this exports to Optus Big Query -- note "overwrite" in method def above
----> 2 write_to_bigquery(table_name, bucket, dataset, project_id)

<command-846265064717827> in write_to_bigquery(tableName, tempBucket, df, project_id)
      1 def write_to_bigquery(tableName, tempBucket, df, project_id):
----> 2   df.write.format("bigquery") \
      3   .option("temporaryGcsBucket", tempBucket) \
      4   .option("parentProject", project_id) \
      5   .option("table", tableName) \

/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    736             self.format(format)
    737         if path is None:
--> 738             self._jwrite.save()
    739         else:
    740             self._jwrite.save(path)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
    for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
  while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

1 error
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
	at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:110)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:174)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:245)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:393)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:192)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:979)
	at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:147)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:343)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:174)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:170)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:590)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:168)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:590)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:566)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:170)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:170)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:155)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:146)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:200)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.UncheckedIOException: Failed to create default Credentials
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:101)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.<init>(BigQueryCredentialsSupplier.java:50)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:53)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule$$FastClassByGuice$$b1b60333.invoke(<generated>)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
	at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1094)
	... 45 more
Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:134)
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:124)
	at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:96)
	at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:99)
	... 64 more

I found this error:

Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

Does it mean the code cannot read the credential in the temporary file?

VZLA
Databricks Employee
Databricks Employee

It seems that nothing is being loaded into the GOOGLE_APPLICATION_CREDENTIALS.

From https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md

 

# The JSON keyfile of the service account used for GCS
# access when google.cloud.auth.service.account.enable is true.
spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile

 

I think its worth trying this before raising a support ticket as there are many details that could be playing a role here. I think os.environ["GOOGLE_APPLICATION_CREDENTIALS"], should've worked to be honest. So, I would expected the following to be a good starting point -summarizing what we've done so far-:

Here is how you can modify your code to ensure it works correctly:

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group