Sunday - last edited Sunday
I want to use google-cloud-bigquery library in my PySpark code though I know that spark-bigquery-connector is available. The reason I want to use is that the Databricks Cluster 15.4LTS comes with 0.22.2-SNAPSHOT version of spark-bigquery-connector which is not supposed to be a working version as per their documentation. They suggested to use 0.41.0 version of spark-bigquery-connector jar in the databricks cluster to leverage all the features.
Since I am unable to use 0.41.0 jar in the Databricks Cluster 15.4LTS, I want to use google-cloud-bigquery lib.
The problem I facing with the below code is:
from google.cloud import bigquery
gcp_project_id = "gcp-test"
drop_partition_query = f""" DELETE FROM `gcp-test.test-dataset.test-table` WHERE partition_date = '2022-01-22' """
print(drop_partition_query)
client = bigquery.Client(project=f"{gcp_project_id}")
result = client.query(drop_partition_query)
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: [Errno 111] Connection refused
WARNING:google.auth._default:No project ID could be determined. Consider running `gcloud config set project` or setting the GOOGLE_CLOUD_PROJECT environment variable
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b322f6d1b10>: Failed to establish a new connection: [Errno 111] Connection refused'))
I tried to set the GOOGLE_CLOUD_PROJECT as shown below:
%env GOOGLE_CLOUD_PROJECT=gcp-test
But still getting the same 'connection refused' error like below (only the project ID warning got resolved)
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: [Errno 111] Connection refused
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/universe/universe_domain (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7b322f6d1b10>: Failed to establish a new connection: [Errno 111] Connection refused'))
Can anyone suggest me how can I use a google-cloud library in Databricks notebook or how can I resolved this 'connection refused' error in Databricks notebook.
#Bigquery #GoogleCloud
Monday
Hello @Alberto_Umana & @VZLA ,
Thanks both of you for your kind suggestions.
I was able to query BQ using spark-bigquery-connector with 0.22.2-SNAPSHOT version. The problem started when I tried to use 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC'. I already explained the issue in the community post. Even though removing the 0.22.2-SNAPSHOT version by init_script was working, but the expected version 0.41.0 was not working after installing. So I choose google-cloud-bigquery lib to perform the partitioned data overwrite operation.
I am now able to resolve the 'connection refused' error by choosing 'Access Mode' to 'No isolation shared' while creating the cluster. Now each BQ query is returning me the expected output through google-cloud-bigquery client. Let me know if there is any concern over choosing 'No isolation shared' access mode.
Monday
Hi @soumiknow,
To resolve the 'connection refused' error when using the google-cloud-bigquery library in your Databricks notebook, you need to ensure that your Databricks cluster is properly configured to authenticate with Google Cloud Platform (GCP). Here are the steps you can follow:
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile /dbfs/<path-to-json-key>
Set the Environment Variable:
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/dbfs/<path-to-json-key>"
Run Your BigQuery Code:
from google.cloud import bigquery
gcp_project_id = "gcp-test"
drop_partition_query = f""" DELETE FROM `gcp-test.test-dataset.test-table` WHERE partition_date = '2022-01-22' """
print(drop_partition_query)
client = bigquery.Client(project=gcp_project_id)
result = client.query(drop_partition_query)
Please review this documentation: https://docs.databricks.com/en/connect/external-systems/bigquery.html
Monday - last edited Monday
Hi @soumiknow
Thank you for sharing the detailed context! To better understand and address the issue, letโs start by clarifying a few points. Are you able to successfully query BigQuery using the default spark-bigquery-connector library, even with the 0.22.2-SNAPSHOT version? This will help confirm whether the connectivity (network/authorization) itself is working or if thereโs a broader issue with the Databricks setup. If there are any network restrictions, you'll need to ensure all traffic to Google Cloud endpoints is allowed in your networking setup.
Let me know:
You may run these troubleshooting steps from a notebook:
https://cloud.google.com/compute/docs/troubleshooting/troubleshoot-metadata-server#rest
Monday
Hello @Alberto_Umana & @VZLA ,
Thanks both of you for your kind suggestions.
I was able to query BQ using spark-bigquery-connector with 0.22.2-SNAPSHOT version. The problem started when I tried to use 'spark.sql.sources.partitionOverwriteMode' to 'DYNAMIC'. I already explained the issue in the community post. Even though removing the 0.22.2-SNAPSHOT version by init_script was working, but the expected version 0.41.0 was not working after installing. So I choose google-cloud-bigquery lib to perform the partitioned data overwrite operation.
I am now able to resolve the 'connection refused' error by choosing 'Access Mode' to 'No isolation shared' while creating the cluster. Now each BQ query is returning me the expected output through google-cloud-bigquery client. Let me know if there is any concern over choosing 'No isolation shared' access mode.
yesterday - last edited yesterday
Hi @soumiknow ,
Glad to hear its been resolved, and really appreciate sharing the solution that worked for you. Is it ok to say that it fails with any other cluster access mode and also DBR version? Interested to know what is the relation between the Access Mode and the connection refused, was it originally a SparkConnect based cluster by any chance (SharedMode)?
yesterday
Hi @VZLA ,
Yes it is failing with other access modes (Shared) and the same 15.4LTS version.
I am quite not sure about the relation between Access mode and connection, but it works. I tried all possible different combinations and finally it worked with 'No isolation shared' access mode.
yesterday
I see. So maybe the session isolation is causing issues with library compatibility or networking, still interesting to understand why for newer library versions only. It might be due to classloading restrictions in isolated sessions preventing the connector from loading necessary dependencies properly; or networking configurations for Spark Connect which might introduce connection issues, such as the "connection refused."
If you'd like to further investigate this, I'd suggest raising a support ticket for better close collaboration on root causing this behavior.
Thanks for your feedback!
yesterday
Thanks @VZLA for your inputs.
I already raised support ticket (00583784). Will see how it goes.
yesterday
@soumiknow could you please check if you're getting the case updates? seems like the engg has sent an email to you, due to some communication update issues.
yesterday
Hi @VZLA ,
Yes, the updates were not coming to me. I asked them to sent an email update which I received yesterday. I am yet to respond due to admin access issue.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group