topic Re: Cannot write data to BigQuery when using Databricks secret in Data Engineering

Cannot write data to BigQuery when using Databricks secret

huytran — Sun, 01 Dec 2024 13:41:32 GMT

I am following this guide on writing data to the BigQuery table.

Right now, I have an error when I try to write data using Databricks Secret instead of the JSON credential file and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.

java.io.IOException: Error getting access token from metadata server at: http://169.x.x.x/computeMetadata/v1/instance/service-accounts/default/token

What is strange here is I can write/read the data using the GOOGLE_APPLICATION_CREDENTIALS environment variable. I can also read the data using Databricks Secret. So I don't know why it is not work when I write the data with Databricks Secret.

Here is my code to read the Databricks Secret:

import base64

cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred)
cred = cred.decode('ascii')
spark.conf.set("credentials", cred)

Below is my code to read/write the data:

# Read data
df = spark.read.format("bigquery")
    .option("parentProject", <parent-project-id>)
    .option("viewsEnabled","true")
    .option("table", <table-name>)
    .load()

# Write data
df.write.format("bigquery") \
  .mode("overwrite") \
  .option("temporaryGcsBucket", <bucket-name>) \
  .option("table", <table-name>) \
  .option("parentProject", <parent-project-id>) \
  .save()

Am I missing any configuration for writing the data to BigQuery with Databricks Secret?

Re: Cannot write data to BigQuery when using Databricks secret

VZLA — Fri, 06 Dec 2024 16:39:00 GMT

Hello, thank you for your question.

To address your issue, the problem likely arises because the BigQuery connector expects a JSON credentials file or an environment variable pointing to it. Using a Databricks Secret directly requires extra steps. Here's how you can resolve it:

Retrieve the JSON credentials from the Databricks Secret as a raw string, avoiding Base64 encoding. Ensure the JSON is passed exactly as it was stored in the secret.
Dynamically create a temporary file in your environment to store the credentials. Use a method to write the JSON credentials to a file and set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of this file.
Confirm that your Spark session has the required configurations set for BigQuery, including the temporary GCS bucket and parent project. Ensure the service account in your credentials has permissions for both the BigQuery table and the GCS bucket.
Double-check that your temporary GCS bucket is properly configured and accessible by the service account, as write operations often rely on this intermediate storage.

If these steps don't resolve the issue, feel free to share additional details for further assistance.

Re: Cannot write data to BigQuery when using Databricks secret

huytran — Mon, 09 Dec 2024 12:41:04 GMT

Hi @VZLA

I tried to get the Databricks Secret, write to a temporary file, and set the temporary file path to GOOGLE_APPLICATION_CREDENTIALS with the code below. And yes, I have to encode the Secret value so I can write it to a temporary file.

cred = dbutils.secrets.get("bigquery-cred", "project-id-name").encode('utf-8') fd = os.open("/tmp/cred.json", os.O_RDWR|os.O_CREAT) ret = os.write(fd, cred) os.close(fd) os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/tmp/cred.json'

Then I got this error:

Py4JJavaError: An error occurred while calling o677.save. : com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

I guess it doesn't get the value from the temporary file. Could you check if I use the correct code to write the file?

Re: Cannot write data to BigQuery when using Databricks secret

VZLA — Mon, 09 Dec 2024 13:18:34 GMT

Interesting, maybe there's a problem with the encoding, can you please give this a try?

import os # Retrieve the JSON credentials from Databricks Secrets cred = dbutils.secrets.get("bigquery-cred", "project-id-name") # Write the JSON to a temporary file temp_cred_path = "/tmp/cred.json" with open(temp_cred_path, "w") as temp_cred_file: temp_cred_file.write(cred) # Set the environment variable to point to the JSON file os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = temp_cred_path # Verify the environment variable and file content (for debugging) print("GOOGLE_APPLICATION_CREDENTIALS:", os.environ['GOOGLE_APPLICATION_CREDENTIALS']) with open(temp_cred_path, "r") as temp_cred_file: print(temp_cred_file.read())

Re: Cannot write data to BigQuery when using Databricks secret

huytran — Mon, 09 Dec 2024 13:33:33 GMT

Hi @VZLA

The code returns below. The 2nd print command returns "[REDACTED]" and I think it is because we don't encode it.

GOOGLE_APPLICATION_CREDENTIALS: /tmp/cred.json [REDACTED]

And I got the same error above:

Py4JJavaError: An error occurred while calling o679.save. : com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient

More information, the way I create the Databricks Secret is I have the file.json value below:

{ "type": "service_account", "project_id": "xxxx-yyy", "private_key_id": "xxxxxxx", "private_key": "-----BEGIN PRIVATE KEY----------END PRIVATE KEY-----\n", "client_email": "xxx@yyyy.iam.gserviceaccount.com", "client_id": "xxxxx", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx%40xxxx.iam.gserviceaccount.com" }

Then I used this command to create the Secret:

databricks secrets put-secret bigquery-cred project-id-name --string-value "$(cat file.json)"

Re: Cannot write data to BigQuery when using Databricks secret

VZLA — Mon, 09 Dec 2024 14:34:21 GMT

The json and command looks good to me.

Is it feasible for you to attach the full stacktrace, and/or check if this is applicable to your cluster setup as well?

https://community.databricks.com/t5/community-platform-discussions/bigquery-in-notebook-failing-with-unity-catalog-enabled-cluster/td-p/59784

Re: Cannot write data to BigQuery when using Databricks secret

huytran — Mon, 09 Dec 2024 14:57:25 GMT

@VZLA

I checked my cluster setup and it is a shared cluster. Many people in my company can use it for their notebook.

And here is the full stackstrace:

--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <command-846265064717828> in <module> 1 # this exports to Optus Big Query -- note "overwrite" in method def above ----> 2 write_to_bigquery(table_name, bucket, dataset, project_id) <command-846265064717827> in write_to_bigquery(tableName, tempBucket, df, project_id) 1 def write_to_bigquery(tableName, tempBucket, df, project_id): ----> 2 df.write.format("bigquery") \ 3 .option("temporaryGcsBucket", tempBucket) \ 4 .option("parentProject", project_id) \ 5 .option("table", tableName) \ /databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options) 736 self.format(format) 737 if path is None: --> 738 self._jwrite.save() 739 else: 740 self._jwrite.save(path) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 115 def deco(*a, **kw): 116 try: --> 117 return f(*a, **kw) 118 except py4j.protocol.Py4JJavaError as e: 119 converted = convert_exception(e.java_exception) /databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value) Py4JJavaError: An error occurred while calling o679.save. : com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors: 1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63) while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient 1 error at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131) at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:110) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:174) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:245) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:393) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:192) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:979) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:343) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:174) at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:170) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:590) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:590) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:566) at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:170) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:170) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:155) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:146) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:200) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.UncheckedIOException: Failed to create default Credentials at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:101) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.<init>(BigQueryCredentialsSupplier.java:50) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:53) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule$$FastClassByGuice$$b1b60333.invoke(<generated>) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39) at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1094) ... 45 more Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information. at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:134) at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:124) at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:96) at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:99) ... 64 more

I found this error:

Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

Does it mean the code cannot read the credential in the temporary file?

Re: Cannot write data to BigQuery when using Databricks secret

VZLA — Mon, 09 Dec 2024 16:00:09 GMT

It seems that nothing is being loaded into the GOOGLE_APPLICATION_CREDENTIALS.

From https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md

# The JSON keyfile of the service account used for GCS # access when google.cloud.auth.service.account.enable is true. spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile

I think its worth trying this before raising a support ticket as there are many details that could be playing a role here. I think os.environ["GOOGLE_APPLICATION_CREDENTIALS"], should've worked to be honest. So, I would expected the following to be a good starting point -summarizing what we've done so far-:

Here is how you can modify your code to ensure it works correctly:

Reading the Databricks Secret and Setting the Spark Configuration:

import base64

# Read the secret from Databricks Secrets
cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred).decode('ascii')

# Set the credentials in Spark configuration
spark.conf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", cred)

Reading Data from BigQuery:

# Read data from BigQuery
df = spark.read.format("bigquery") \
    .option("parentProject", "<parent-project-id>") \
    .option("viewsEnabled", "true") \
    .option("table", "<table-name>") \
    .load()

Writing Data to BigQuery:

# Write data to BigQuery
df.write.format("bigquery") \
    .mode("overwrite") \
    .option("temporaryGcsBucket", "<bucket-name>") \
    .option("table", "<table-name>") \
    .option("parentProject", "<parent-project-id>") \
    .save()

About the stacktrace itself, it simply looks like indeed the GOOGLE_APPLICATION_CREDENTIALS isn't set or is inaccessible, and none of the other defaultt credential sources are available:

https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth/oauth2/GoogleCredentials.java#L124
- Consider this https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth/oauth2/GoogleCredentials.java#L107-L108
https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth/oauth2/DefaultCredentialsProvider.java#L133
- I think it eventually goes through https://github.com/googleapis/google-auth-library-java/blob/v0.27.0/oauth2_http/java/com/google/auth/oauth2/DefaultCredentialsProvider.java#L159 and attempts the other methods for a couple of second before bailing out.