Exporting table to GCS bucket using job

aswinvishnu
New Contributor II

Hi all,

Usecase: I want to send the result of a query to GCS bucket location in json format.

Approach: From my java based application I create a job and that job will be running a notebook`. Notebook will have something like this

```

query = "SELECT * FROM table"
df = spark.sql(query)
gcs_path = "gs://<bucket>/path/"
df.write.option("maxRecordsPerFile", int("100")).mode("overwrite").json(gcs_path)
```
I am able to provide access to my gcs bucket using a service account json which has access to my gcs account. But for my usecase. I cant provide the service account information to the databricks account. But rather I am okay with exposing an access token which will be created from the service account.

I tried something like
```
spark.conf.set("spark.hadoop.fs.gs.auth.type", "OAuth")
spark.conf.set("spark.hadoop.fs.gs.auth.access.token", access_token)
```

which didn't had any effect. I am getting below error in my notebook
Py4JJavaError: An error occurred while calling o476.json. : java.io.IOException: Error getting access token from metadata server at:

Kind of stuck in this. Any help would be appreciated.
Thanks,
Aswin

lingareddy_Alva
Esteemed Contributor

Hi @aswinvishnu 

  • GCS support in Spark via Hadoop Connectors has specific limitations, and using a raw access token (OAuth token) instead of a service account key file is tricky, especially in Databricks.
    You’re trying to use access token–based authentication, but GCS's Hadoop connector (used under the hood by Spark) typically expects:
            1. Service Account key file (standard)
            2. Or ADC (Application Default Credentials) from the environment/metadata server (in GCP-native       services  like  GKE or Dataproc)


Databricks is not natively GCP, so it doesn't have access to the GCP metadata server, hence the error:
Error getting access token from metadata server..

Use spark.hadoop.fs.gs.auth.type=ACCESS_TOKEN (Not "OAuth")
If you insist on using an access token instead of a key file, change your auth type:

spark.conf.set("spark.hadoop.fs.gs.auth.type", "ACCESS_TOKEN")
spark.conf.set("spark.hadoop.fs.gs.auth.access.token", access_token)

This is the correct config to pass a bearer token manually (OAuth is for interactive user flows; ACCESS_TOKEN is for static token use like this).
However, this still may not work reliably in Spark unless you're using the right version of the GCS connector (>= 2.2.0). Databricks may bundle older or customized versions.

 

LR

Hi @lingareddy_Alva,
Thanks for the reply. I tried the 'ACCESS_TOKEN' auth type too, but it didn't made any difference.

LorelaiSpence
New Contributor II

Consider using GCS signed URLs or access tokens for secure access.

When I was working on a big research project, I stumbled upon DoMyPaper at https://domypaper.com/ and it turned out to be a huge help. They offer a fantastic paper writing service that guided me step-by-step.