I am using Databricks Runtime 15.4 (Spark 3.5 / Scala 2.12) on AWS.
My goal is to use the latest Google BigQuery connector because I need the direct write method (BigQuery Storage Write API):
option("writeMethod", "direct")This allows writing directly into BigQuery without requiring a temporary GCS bucket, which is necessary in my environment.
To do this, I installed the official Google connector as a cluster library via Maven:
com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.43.1
The library installs successfully and shows as "Attached" on the cluster.
However, Databricks does not use this connector at runtime. To check which connector is actually being loaded, I run:
python
jvm = spark._jvm
provider = jvm.com.google.cloud.spark.bigquery.BigQueryRelationProvider()
location = provider.getClass().getProtectionDomain().getCodeSource().getLocation().toString()
print(location)
The output is always:
...spark-bigquery-connector-hive-2.3__hadoop-3.2_2.12--fatJar-assembly-0.22.2-SNAPSHOT.jar
This means Databricks always loads its built-in forked connector (0.22.2-SNAPSHOT) instead of the Google connector (0.43.x) that I installed.
Additional observations:
Restarting the cluster does not change anything.
The installed connector appears as "Attached" but never shows up in /databricks/jars.
/databricks/jars only contains:
- spark-bigquery-connector-hive-2.3__hadoop-3.2_2.12--fatJar-assembly-0.22.2-SNAPSHOT.jar
- spark-bigquery-with-dependencies_2.12-0.41.0.jar (Databricks' own copy)
spark.read.format("bigquery") still resolves to the built-in connector every time.
Question: Is there any supported way on Databricks Runtime 15.4 to override or replace the built-in BigQuery connector so that:
spark.read.format("bigquery")uses the Google spark-bigquery-with-dependencies_2.12 (0.43.x) connector, specifically to allow using the direct write method without a temporary GCS bucket?
Or is the Databricks BigQuery connector version fixed and not user-overridable?