12-01-2024 05:41 AM
I am following this guide on writing data to the BigQuery table.
Right now, I have an error when I try to write data using Databricks Secret instead of the JSON credential file and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable.
java.io.IOException: Error getting access token from metadata server at: http://169.x.x.x/computeMetadata/v1/instance/service-accounts/default/token
What is strange here is I can write/read the data using the GOOGLE_APPLICATION_CREDENTIALS environment variable. I can also read the data using Databricks Secret. So I don't know why it is not work when I write the data with Databricks Secret.
Here is my code to read the Databricks Secret:
import base64 cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii') cred = base64.b64encode(cred) cred = cred.decode('ascii') spark.conf.set("credentials", cred)
Below is my code to read/write the data:
# Read data df = spark.read.format("bigquery") .option("parentProject", <parent-project-id>) .option("viewsEnabled","true") .option("table", <table-name>) .load() # Write data df.write.format("bigquery") \ .mode("overwrite") \ .option("temporaryGcsBucket", <bucket-name>) \ .option("table", <table-name>) \ .option("parentProject", <parent-project-id>) \ .save()
Am I missing any configuration for writing the data to BigQuery with Databricks Secret?
12-06-2024 08:39 AM
Hello, thank you for your question.
To address your issue, the problem likely arises because the BigQuery connector expects a JSON credentials file or an environment variable pointing to it. Using a Databricks Secret directly requires extra steps. Here's how you can resolve it:
If these steps don't resolve the issue, feel free to share additional details for further assistance.
12-09-2024 04:41 AM
Hi @VZLA
I tried to get the Databricks Secret, write to a temporary file, and set the temporary file path to GOOGLE_APPLICATION_CREDENTIALS with the code below. And yes, I have to encode the Secret value so I can write it to a temporary file.
cred = dbutils.secrets.get("bigquery-cred", "project-id-name").encode('utf-8')
fd = os.open("/tmp/cred.json", os.O_RDWR|os.O_CREAT)
ret = os.write(fd, cred)
os.close(fd)
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/tmp/cred.json'
Then I got this error:
Py4JJavaError: An error occurred while calling o677.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient
I guess it doesn't get the value from the temporary file. Could you check if I use the correct code to write the file?
12-09-2024 05:18 AM
Interesting, maybe there's a problem with the encoding, can you please give this a try?
import os
# Retrieve the JSON credentials from Databricks Secrets
cred = dbutils.secrets.get("bigquery-cred", "project-id-name")
# Write the JSON to a temporary file
temp_cred_path = "/tmp/cred.json"
with open(temp_cred_path, "w") as temp_cred_file:
temp_cred_file.write(cred)
# Set the environment variable to point to the JSON file
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = temp_cred_path
# Verify the environment variable and file content (for debugging)
print("GOOGLE_APPLICATION_CREDENTIALS:", os.environ['GOOGLE_APPLICATION_CREDENTIALS'])
with open(temp_cred_path, "r") as temp_cred_file:
print(temp_cred_file.read())
12-09-2024 05:33 AM
Hi @VZLA
The code returns below. The 2nd print command returns "[REDACTED]" and I think it is because we don't encode it.
GOOGLE_APPLICATION_CREDENTIALS: /tmp/cred.json
[REDACTED]
And I got the same error above:
Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient
More information, the way I create the Databricks Secret is I have the file.json value below:
{
"type": "service_account",
"project_id": "xxxx-yyy",
"private_key_id": "xxxxxxx",
"private_key": "-----BEGIN PRIVATE KEY----------END PRIVATE KEY-----\n",
"client_email": "xxx@yyyy.iam.gserviceaccount.com",
"client_id": "xxxxx",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx%40xxxx.iam.gserviceaccount.com"
}
Then I used this command to create the Secret:
databricks secrets put-secret bigquery-cred project-id-name --string-value "$(cat file.json)"
12-09-2024 06:34 AM
The json and command looks good to me.
Is it feasible for you to attach the full stacktrace, and/or check if this is applicable to your cluster setup as well?
12-09-2024 06:51 AM - edited 12-09-2024 06:57 AM
I checked my cluster setup and it is a shared cluster. Many people in my company can use it for their notebook.
And here is the full stackstrace:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<command-846265064717828> in <module>
1 # this exports to Optus Big Query -- note "overwrite" in method def above
----> 2 write_to_bigquery(table_name, bucket, dataset, project_id)
<command-846265064717827> in write_to_bigquery(tableName, tempBucket, df, project_id)
1 def write_to_bigquery(tableName, tempBucket, df, project_id):
----> 2 df.write.format("bigquery") \
3 .option("temporaryGcsBucket", tempBucket) \
4 .option("parentProject", project_id) \
5 .option("table", tableName) \
/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
736 self.format(format)
737 if path is None:
--> 738 self._jwrite.save()
739 else:
740 self._jwrite.save(path)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
1302
1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
1305 answer, self.gateway_client, self.target_id, self.name)
1306
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
115 def deco(*a, **kw):
116 try:
--> 117 return f(*a, **kw)
118 except py4j.protocol.Py4JJavaError as e:
119 converted = convert_exception(e.java_exception)
/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o679.save.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:
1) Error in custom provider, java.io.UncheckedIOException: Failed to create default Credentials
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:46)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier
for the 3rd parameter of com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryClient(BigQueryClientModule.java:63)
while locating com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClient
1 error
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:110)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:80)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:78)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:89)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.$anonfun$applyOrElse$1(QueryExecution.scala:174)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$8(SQLExecution.scala:245)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:393)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:192)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:979)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:147)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:343)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:174)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$$nestedInanonfun$eagerlyExecuteCommands$1$1.applyOrElse(QueryExecution.scala:170)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:590)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:168)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:590)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:268)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:566)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$eagerlyExecuteCommands$1(QueryExecution.scala:170)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:324)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:170)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:155)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:146)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:200)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:959)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:427)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:258)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:295)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.UncheckedIOException: Failed to create default Credentials
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:101)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.<init>(BigQueryCredentialsSupplier.java:50)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule.provideBigQueryCredentialsSupplier(BigQueryClientModule.java:53)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryClientModule$$FastClassByGuice$$b1b60333.invoke(<generated>)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:42)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:65)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1094)
... 45 more
Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:134)
at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:124)
at com.google.cloud.spark.bigquery.repackaged.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:96)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.connector.common.BigQueryCredentialsSupplier.createDefaultCredentials(BigQueryCredentialsSupplier.java:99)
... 64 more
I found this error:
Caused by: java.io.IOException: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Does it mean the code cannot read the credential in the temporary file?
12-09-2024 07:54 AM - edited 12-09-2024 08:00 AM
It seems that nothing is being loaded into the GOOGLE_APPLICATION_CREDENTIALS.
From https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md
# The JSON keyfile of the service account used for GCS
# access when google.cloud.auth.service.account.enable is true.
spark.hadoop.google.cloud.auth.service.account.json.keyfile=/path/to/keyfile
I think its worth trying this before raising a support ticket as there are many details that could be playing a role here. I think os.environ["GOOGLE_APPLICATION_CREDENTIALS"], should've worked to be honest. So, I would expected the following to be a good starting point -summarizing what we've done so far-:
Here is how you can modify your code to ensure it works correctly:
Reading the Databricks Secret and Setting the Spark Configuration:
import base64
# Read the secret from Databricks Secrets
cred = dbutils.secrets.get(scope="bigquery-scope", key="secret-name").encode('ascii')
cred = base64.b64encode(cred).decode('ascii')
# Set the credentials in Spark configuration
spark.conf.set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", cred)
Reading Data from BigQuery:
# Read data from BigQuery
df = spark.read.format("bigquery") \
.option("parentProject", "<parent-project-id>") \
.option("viewsEnabled", "true") \
.option("table", "<table-name>") \
.load()
Writing Data to BigQuery:
# Write data to BigQuery
df.write.format("bigquery") \
.mode("overwrite") \
.option("temporaryGcsBucket", "<bucket-name>") \
.option("table", "<table-name>") \
.option("parentProject", "<parent-project-id>") \
.save()
About the stacktrace itself, it simply looks like indeed the GOOGLE_APPLICATION_CREDENTIALS isn't set or is inaccessible, and none of the other defaultt credential sources are available:
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group