The issue you are facing arises when using mode("overwrite") with Spark to load data into BigQuery—the error indicates BigQuery expects a STRING type for the source column, but it is being supplied a STRUCT type during overwrite operations. Strangely, the same data loads fine in append mode and the temporary table shows the correct STRING type for source as well.
Key Points from Your Case
-
Append mode works: The data writes as expected and BigQuery's schema keeps source as STRING.
-
Overwrite mode fails: Exception about STRUCT vs STRING mismatch, even when data appears correctly formatted.
-
DBR version and connector: Using Databricks DBR 16.1 and spark-3.5-bigquery-0.41.0 connector.
-
Schema inference: Spark infers schemas for columns unless explicitly provided.
-
Temporary tables during overwrite: Overwrite operations sometimes create/rely on temp tables or rely on schema merging logic, which can cause mismatches if the source DataFrame and the target table schema don't align perfectly.
Why the Issue Occurs
When using overwrite, Spark replaces the table's data and often needs to reconcile the schema between your DataFrame and the existing BigQuery table schema. If your DataFrame column (e.g., source) has nested fields or is null, Spark/BigQuery may interpret this as STRUCT—especially if any row contains a dictionary/object instead of a plain string.
-
In append mode, BigQuery applies type conversion based on incoming data; new tables get string columns because your DataFrame’s DDL shows STRING type.
-
In overwrite mode, schema merging can be stricter or misinterpret ambiguous types (e.g., all-null columns or mix of types), leading to a STRUCT-STRING mismatch.
Solutions & Workarounds
1. Explicitly Cast Columns in DataFrame
Before writing, make sure all columns match the expected types:
from pyspark.sql.functions import col
df = df.withColumn("source", col("source").cast("string"))
-
This ensures Spark sees source as STRING regardless of any ambiguous values.
2. Set Schema Explicitly in BigQuery
If possible, create your BigQuery table with an explicit schema before overwrite, ensuring all columns have correct types.
3. Drop and Recreate Table Before Overwrite
Since BigQuery’s overwrite does not always fully drop old schema metadata (especially with partitioned tables), drop the table before writing:
from google.cloud import bigquery
client = bigquery.Client(project=gcp_project_id)
client.delete_table(bq_table_name, not_found_ok=True)
# Now write with overwrite mode as usual
This forces Spark to create a new table with the schema inferred from your DataFrame.
4. Check DataFrame for Any STRUCTs
Print your DataFrame schema just before writing:
If you see StructType for source, there is a data issue—ensure the source column always holds string values.
5. Connector Compatibility
Ensure the jar and DBR versions are compatible. Sometimes, subtle connector changes affect schema handling in overwrite mode. Consider upgrading the connector if possible.
Conclusion
-
The problem arises due to schema type handling differences between append and overwrite modes in Spark-BigQuery connector.
-
Explicitly casting columns and ensuring table schema matches the DataFrame schema before overwrite are the most reliable fixes.
-
Dropping the table beforehand can prevent schema legacy mismatches, especially in overwrite flows.
Always test with a small batch to verify if the problem resolves. If the error persists, further debug the raw DataFrame contents or connector settings.