โ12-01-2022 01:42 AM
Hi,
I would like to connect our BigQuery env to Databricks, So I created a service account but where should I configure the service account in Databricks? I read databricks documention and it`s not clear at all.
Thanks for your help
โ12-01-2022 03:46 AM
without the pointy brackets. they are placeholders for values.
so unless you want to enter a variable which you already declared (like credentials in your example), put the double quotes.
โ12-01-2022 03:17 AM
https://docs.databricks.com/external-data/bigquery.html
Can you elaborate what is not clear?
โ12-01-2022 03:21 AM
yeah, part number 2 - setup Databricks, there is the below code
credentials <base64-keys>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
what should it replace instead of <base64-keys> ? the google service account key (json) ? if yes what part of it ?
โ12-01-2022 03:24 AM
the base64-keys is generated from the json key file:
To configure a cluster to access BigQuery tables, you must provide your JSON key file as a Spark configuration. Use a local tool to Base64-encode your JSON key file. For security purposes do not use a web-based or remote tool that could access your keys.
The JSON key file is created right above the following section:
โ12-01-2022 03:35 AM
So basically it should look like this :
credentials <adfasdfsadfadsfsdafsd>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <user@service.com>
spark.hadoop.fs.gs.project.id <project-dd>
spark.hadoop.fs.gs.auth.service.account.private.key <fdsfsdfsdgfd>
spark.hadoop.fs.gs.auth.service.account.private.key.id <gsdfgsdgdsg>
? Do I need to add "" ?
โ12-01-2022 03:46 AM
without the pointy brackets. they are placeholders for values.
so unless you want to enter a variable which you already declared (like credentials in your example), put the double quotes.
โ12-01-2022 04:17 AM
Thanks werners.
it now working, when I'm runnning the below script:
df = spark.read.format("bigquery").option("table","sandbox.test").load()
im getting the below error:
โ12-01-2022 04:18 AM
โ12-01-2022 04:25 AM
are you sure the path to the table is correct?
the example is a bit different:
"bigquery-public-data.samples.shakespeare"
<catalog>.<db>.<table>
โ12-01-2022 04:33 AM
I also changed the path to "test_proj.sandbox.test".
the error is :
A project ID is required for this service but could not be determined from the builder or the environment. Please set a project ID using the builder.
โ12-01-2022 04:38 AM
I guess something still has to be configured on BigQuery.
can you check this thread?
https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/40
โ12-01-2022 04:43 AM
Works ๐
Thanks werners, many thanks .
โ02-02-2023 07:53 PM
Thank you. For me, setting parent project ID solved it. This is also in the documentation
spark.read.format("bigquery") \
.option("table", table) \
.option("project", <project-id>) \
.option("parentProject", <parent-project-id>) \
.load()
I didn't have to set the various spark.hadoop.fs.gs config variables for the cluster, as it seemed content with the base64 credentials.
โ12-01-2022 03:21 AM
โ12-01-2022 03:23 AM
I familar with this doc, it is not clear (please find my previous comment)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group