cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

von Google Cloud Storage

refah_1
New Contributor

Hi everyone,

I'm new to Databricks and am trying to connect my Google Cloud Storage bucket to my Databricks workspace. I have a 43GB CSV file stored in a GCP bucket that I want to work with. Hereโ€™s what I've done so far:

  1. Bucket Setup:

    • I created a GCP bucket (in the west6 region) where my CSV file is stored.
  2. Databricks Configuration:

    • I have a Databricks workspace (in the west2 region).
    • I created a storage credential in Unity Catalog using a GCP Service Account, and I noted down the service account email.
  3. IAM Roles:

    • In the Google Cloud Console, I granted the service account the Storage Legacy Bucket Reader and Storage Object Admin roles on my bucket.
  4. External Location:

    • I attempted to create an external location in Databricks, pointing to gs://<my-bucket-name>/, using the storage credential I created.

Despite following these steps, Iโ€™m unable to see or access my CSV file from Databricks. Iโ€™m not sure if the region difference (bucket in west6 vs. workspace in west2) or something else is causing the issue.

Has anyone experienced a similar problem or can provide guidance on troubleshooting this connection? Any help would be greatly appreciated!

Thanks in advance!

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @refah_1 ,  Thanks for laying out the stepsโ€”youโ€™re very close. Hereโ€™s a structured checklist to get GCS working with Unity Catalog and a couple of common gotchas to check.

 

Whatโ€™s likely going on

  • The region mismatch isnโ€™t the root cause; docs emphasize co-locating the bucket and workspace mainly to avoid egress charges, not as a hard requirement for connectivity.
  • If your GCS bucket has Hierarchical Namespace (HNS) enabled, Unity Catalog external locations wonโ€™t work. Make sure HNS is disabled on that bucket.
  • You must assign the required GCS IAM roles to the Databricksโ€‘generated service account associated with your storage credential. The roles are Storage Legacy Bucket Reader and Storage Object Admin, and they must be granted on the bucket to that service account principal exactly.
  • To see or use the external location in Databricks, you also need Unity Catalog privileges:
    • BROWSE to list paths in Catalog Explorer.
    • READ FILES to read files via gs://... paths.

Quick validation steps in Databricks

  1. In Catalog Explorer, open your External location and click Test connection (verifies READ/WRITE/PATH EXIST/DELETE). If this fails, the issue is with the storage credential, IAM on the bucket, or the path.
  2. Confirm your external location URL points to the correct parent path containing the CSV, for example gs://my-bucket/ or gs://my-bucket/path/. Paths must use only ASCII characters (Aโ€“Z, aโ€“z, 0โ€“9, /, _, -).
  3. Grant yourself the external location privileges:
    • In Catalog Explorer > External locations > your location > Permissions, grant your user/group:
      • BROWSE (to list) and READ FILES (to read).
  4. Verify from a notebook: python display(dbutils.fs.ls('gs://<my-bucket>/')) # or the subfolder that contains the CSV python spark.read.format("csv") \ .option("header", "true") \ .option("inferSchema", "true") \ .load('gs://<my-bucket>/<path-to>/file.csv') \ .display()

Doubleโ€‘check the GCP side

  • Ensure you granted the roles to the exact Databricksโ€‘generated service account that appears when you created the storage credential (looks like an email). Assign: * Storage Legacy Bucket Reader * Storage Object Admin on the bucket where the CSV resides.
  • If you want Databricks to configure file events (optional but recommended), add the custom Pub/Sub and storage.buckets.update permissions described in the docs; otherwise skip this for now.

About regions

  • Coโ€‘locating the workspace and bucket is recommended to avoid egress charges and reduce latency, but region differences alone arenโ€™t called out as a connection blocker in the docs.

If you still canโ€™t see the file

  • Confirm HNS is disabled on the bucket (not supported with UC external locations).
  • Recreate the storage credential via Catalog Explorer so Databricks generates the service account, then reโ€‘grant the two roles on the bucket to that service account; retest the external location.
  • Make sure youโ€™re accessing with Unity Catalogโ€‘enabled compute and using threeโ€‘level namespaces elsewhere (catalog.schema.table) to avoid defaulting to the legacy Hive metastore when creating tables over the data.

Next step ideas for your 43GB CSV

  • Read the CSV once and write it to Delta for faster, reliable downstream reads: python df = (spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load('gs://<my-bucket>/<path-to>/file.csv')) df.write.format("delta").mode("overwrite").save('gs://<my-bucket>/<path-to>/delta') Then use an external location to govern that Delta path, or register an external table over it.
 
Hope this helps, Louis.