cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Exporting table to GCS bucket using job

aswinvishnu
New Contributor II

Hi all,

Usecase: I want to send the result of a query to GCS bucket location in json format.

Approach: From my java based application I create a job and that job will be running a notebook`. Notebook will have something like this

```

query = "SELECT * FROM table"
df = spark.sql(query)
gcs_path = "gs://<bucket>/path/"
df.write.option("maxRecordsPerFile", int("100")).mode("overwrite").json(gcs_path)
```
I am able to provide access to my gcs bucket using a service account json which has access to my gcs account. But for my usecase. I cant provide the service account information to the databricks account. But rather I am okay with exposing an access token which will be created from the service account.

I tried something like
```
spark.conf.set("spark.hadoop.fs.gs.auth.type", "OAuth")
spark.conf.set("spark.hadoop.fs.gs.auth.access.token", access_token)
```

which didn't had any effect. I am getting below error in my notebook
Py4JJavaError: An error occurred while calling o476.json. : java.io.IOException: Error getting access token from metadata server at:

Kind of stuck in this. Any help would be appreciated.
Thanks,
Aswin
3 REPLIES 3

lingareddy_Alva
Honored Contributor III

Hi @aswinvishnu 

  • GCS support in Spark via Hadoop Connectors has specific limitations, and using a raw access token (OAuth token) instead of a service account key file is tricky, especially in Databricks.
    You’re trying to use access token–based authentication, but GCS's Hadoop connector (used under the hood by Spark) typically expects:
            1. Service Account key file (standard)
            2. Or ADC (Application Default Credentials) from the environment/metadata server (in GCP-native       services  like  GKE or Dataproc)


Databricks is not natively GCP, so it doesn't have access to the GCP metadata server, hence the error:
Error getting access token from metadata server..

Use spark.hadoop.fs.gs.auth.type=ACCESS_TOKEN (Not "OAuth")
If you insist on using an access token instead of a key file, change your auth type:

spark.conf.set("spark.hadoop.fs.gs.auth.type", "ACCESS_TOKEN")
spark.conf.set("spark.hadoop.fs.gs.auth.access.token", access_token)

This is the correct config to pass a bearer token manually (OAuth is for interactive user flows; ACCESS_TOKEN is for static token use like this).
However, this still may not work reliably in Spark unless you're using the right version of the GCS connector (>= 2.2.0). Databricks may bundle older or customized versions.

 

LR

Hi @lingareddy_Alva,
Thanks for the reply. I tried the 'ACCESS_TOKEN' auth type too, but it didn't made any difference.

LorelaiSpence
New Contributor II

Consider using GCS signed URLs or access tokens for secure access.

When I was working on a big research project, I stumbled upon DoMyPaper at https://domypaper.com/ and it turned out to be a huge help. They offer a fantastic paper writing service that guided me step-by-step.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now