Databricks Community

soumiknow · ‎12-29-2024

I want to use google-cloud-bigquery library in my PySpark code though I know that spark-bigquery-connector is available. The reason I want to use is that the Databricks Cluster 15.4LTS comes with 0.22.2-SNAPSHOT version of spark-bigquery-connector which is not supposed to be a working version as per their documentation. They suggested to use 0.41.0 version of spark-bigquery-connector jar in the databricks cluster to leverage all the features.

Since I am unable to use 0.41.0 jar in the Databricks Cluster 15.4LTS, I want to use google-cloud-bigquery lib.

The problem I facing with the below code is:

from google.cloud import bigquery

gcp_project_id = "gcp-test"
drop_partition_query = f""" DELETE FROM `gcp-test.test-dataset.test-table` WHERE partition_date = '2022-01-22' """ 
print(drop_partition_query)
client = bigquery.Client(project=f"{gcp_project_id}")
result = client.query(drop_partition_query)

WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: [Errno 111] Connection refused
WARNING:google.auth._default:No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b322f6d1b10>: Failed to establish a new connection: [Errno 111] Connection refused'))

I tried to set the GOOGLE_CLOUD_PROJECT as shown below:

%env GOOGLE_CLOUD_PROJECT=gcp-test

But still getting the same 'connection refused' error like below (only the project ID warning got resolved)

WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: [Errno 111] Connection refused
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b322f6d1b10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Can anyone suggest me how can I use a google-cloud library in Databricks notebook or how can I resolved this 'connection refused' error in Databricks notebook.

#Bigquery #GoogleCloud

soumiknow · ‎12-30-2024

Hello @Alberto_Umana & @VZLA ,

Thanks both of you for your kind suggestions.

I was able to query BQ using spark-bigquery-connector with 0.22.2-SNAPSHOT version. The problem started when I tried to use 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC'. I already explained the issue in the community post. Even though removing the 0.22.2-SNAPSHOT version by init_script was working, but the expected version 0.41.0 was not working after installing. So I choose google-cloud-bigquery lib to perform the partitioned data overwrite operation.

I am now able to resolve the 'connection refused' error by choosing 'Access Mode' to 'No isolation shared' while creating the cluster. Now each BQ query is returning me the expected output through google-cloud-bigquery client. Let me know if there is any concern over choosing 'No isolation shared' access mode.

View solution in original post

Alberto_Umana · ‎12-30-2024

Hi @soumiknow,

To resolve the 'connection refused' error when using the google-cloud-bigquery library in your Databricks notebook, you need to ensure that your Databricks cluster is properly configured to authenticate with Google Cloud Platform (GCP). Here are the steps you can follow:

Create a Google Service Account and Key:

Go to the Google Cloud Console.
Navigate to IAM & Admin > Service Accounts.
Click Create Service Account.
Provide a name and description for the service account.
Assign the necessary roles (e.g., BigQuery Data Viewer, BigQuery Job User).
Click Create and then Done.
In the Service Accounts list, find your new service account and click on it.
Go to the Keys section, click Add Key, and select Create new key.
Choose JSON and click Create. This will download a JSON key file to your computer.

Upload the JSON Key File to Databricks:

In your Databricks workspace, go to Data > DBFS > Upload and upload the JSON key file.

Configure the Databricks Cluster:

Go to Clusters and select your cluster.
Click on the Spark Config tab.
Add the following Spark configuration, replacing `<path-to-json-key>` with the path to your uploaded JSON key file:

spark.hadoop.google.cloud.auth.service.account.enable true

spark.hadoop.google.cloud.auth.service.account.json.keyfile /dbfs/<path-to-json-key>

Set the Environment Variable:

In your notebook, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to the JSON key file:

import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/dbfs/<path-to-json-key>"

Run Your BigQuery Code:

Now you can run your BigQuery code in the notebook. Here is an example:

from google.cloud import bigquery

gcp_project_id = "gcp-test"

drop_partition_query = f""" DELETE FROM `gcp-test.test-dataset.test-table` WHERE partition_date = '2022-01-22' """

print(drop_partition_query)

client = bigquery.Client(project=gcp_project_id)

result = client.query(drop_partition_query)

Please review this documentation: https://docs.databricks.com/en/connect/external-systems/bigquery.html

VZLA · ‎12-30-2024

Hi @soumiknow

Thank you for sharing the detailed context! To better understand and address the issue, let’s start by clarifying a few points. Are you able to successfully query BigQuery using the default spark-bigquery-connector library, even with the 0.22.2-SNAPSHOT version? This will help confirm whether the connectivity (network/authorization) itself is working or if there’s a broader issue with the Databricks setup. If there are any network restrictions, you'll need to ensure all traffic to Google Cloud endpoints is allowed in your networking setup.

Let me know:

If the default library works as expected.
Whether the issue is related to authentication or network configurations.

You may run these troubleshooting steps from a notebook:

https://cloud.google.com/compute/docs/troubleshooting/troubleshoot-metadata-server#rest

soumiknow · ‎12-30-2024

Hello @Alberto_Umana & @VZLA ,

Thanks both of you for your kind suggestions.

I was able to query BQ using spark-bigquery-connector with 0.22.2-SNAPSHOT version. The problem started when I tried to use 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC'. I already explained the issue in the community post. Even though removing the 0.22.2-SNAPSHOT version by init_script was working, but the expected version 0.41.0 was not working after installing. So I choose google-cloud-bigquery lib to perform the partitioned data overwrite operation.

I am now able to resolve the 'connection refused' error by choosing 'Access Mode' to 'No isolation shared' while creating the cluster. Now each BQ query is returning me the expected output through google-cloud-bigquery client. Let me know if there is any concern over choosing 'No isolation shared' access mode.

VZLA · ‎12-31-2024

Hi @soumiknow ,

Glad to hear its been resolved, and really appreciate sharing the solution that worked for you. Is it ok to say that it fails with any other cluster access mode and also DBR version? Interested to know what is the relation between the Access Mode and the connection refused, was it originally a SparkConnect based cluster by any chance (SharedMode)?

soumiknow · ‎12-31-2024

Hi @VZLA ,

Yes it is failing with other access modes (Shared) and the same 15.4LTS version.

I am quite not sure about the relation between Access mode and connection, but it works. I tried all possible different combinations and finally it worked with 'No isolation shared' access mode.

VZLA · ‎12-31-2024

I see. So maybe the session isolation is causing issues with library compatibility or networking, still interesting to understand why for newer library versions only. It might be due to classloading restrictions in isolated sessions preventing the connector from loading necessary dependencies properly; or networking configurations for Spark Connect which might introduce connection issues, such as the "connection refused."

If you'd like to further investigate this, I'd suggest raising a support ticket for better close collaboration on root causing this behavior.

Thanks for your feedback!

soumiknow · ‎12-31-2024

Thanks @VZLA for your inputs.

I already raised support ticket (00583784). Will see how it goes.

VZLA · ‎12-31-2024

@soumiknow could you please check if you're getting the case updates? seems like the engg has sent an email to you, due to some communication update issues.

soumiknow · ‎12-31-2024

Hi @VZLA ,

Yes, the updates were not coming to me. I asked them to sent an email update which I received yesterday. I am yet to respond due to admin access issue.