Databricks Community

soumiknow · ‎05-18-2025

We have the following code which we used to load data to BigQuery table after reading the parquet files from Azure Data Lake Storage:

df.write.format("bigquery").option( "parentProject", gcp_project_id ).option("table", f"{bq_table_name}").option( "temporaryGcsBucket", f"{temp_gcs_bucket}" ).option( "spark.sql.sources.partitionOverwriteMode", "DYNAMIC" ).option( "writeMethod", "indirect" ).mode( "overwrite" ).save()

This code was working, but for last week onwards we are getting the following exception:

bigquery.storageapi.shaded.com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Query error: Value has type STRUCT which cannot be inserted into column source, which has type STRING at [7:2393]

Even though the exception clearly mention that the incoming data has STRUCT data type instead of STRING data type. But we cross checked the data which was not at all in STRUCT.

However, when we changed the mode to 'append' from 'overwrite'; the same data got loaded successfully.

To re-confirm the incoming data type, we loaded the specific 'source' column data to a temporary table which is being created automatically by providing the dataset name in 'append' mode. On newly created table, the same 'source' column data loaded with 'STRING' data type only.

So, the issue is happening with 'overwrite' method only.

We are using DBR 16.1 and spark-3.5-bigauery-0.41.0.jar.

Please suggest how to overcome this issue with 'overwrite' method.

mark_ott · ‎10-29-2025

The issue you are facing arises when using mode("overwrite") with Spark to load data into BigQuery—the error indicates BigQuery expects a STRING type for the source column, but it is being supplied a STRUCT type during overwrite operations. Strangely, the same data loads fine in append mode and the temporary table shows the correct STRING type for source as well.

Key Points from Your Case

Append mode works: The data writes as expected and BigQuery's schema keeps source as STRING.
Overwrite mode fails: Exception about STRUCT vs STRING mismatch, even when data appears correctly formatted.
DBR version and connector: Using Databricks DBR 16.1 and spark-3.5-bigquery-0.41.0 connector.
Schema inference: Spark infers schemas for columns unless explicitly provided.
Temporary tables during overwrite: Overwrite operations sometimes create/rely on temp tables or rely on schema merging logic, which can cause mismatches if the source DataFrame and the target table schema don't align perfectly.

Why the Issue Occurs

When using overwrite, Spark replaces the table's data and often needs to reconcile the schema between your DataFrame and the existing BigQuery table schema. If your DataFrame column (e.g., source) has nested fields or is null, Spark/BigQuery may interpret this as STRUCT—especially if any row contains a dictionary/object instead of a plain string.

In append mode, BigQuery applies type conversion based on incoming data; new tables get string columns because your DataFrame’s DDL shows STRING type.
In overwrite mode, schema merging can be stricter or misinterpret ambiguous types (e.g., all-null columns or mix of types), leading to a STRUCT-STRING mismatch.

Solutions & Workarounds

1. Explicitly Cast Columns in DataFrame

Before writing, make sure all columns match the expected types:

python

from pyspark.sql.functions import col

df = df.withColumn("source", col("source").cast("string"))

This ensures Spark sees source as STRING regardless of any ambiguous values.

2. Set Schema Explicitly in BigQuery

If possible, create your BigQuery table with an explicit schema before overwrite, ensuring all columns have correct types.

3. Drop and Recreate Table Before Overwrite

Since BigQuery’s overwrite does not always fully drop old schema metadata (especially with partitioned tables), drop the table before writing:

python

from google.cloud import bigquery

client = bigquery.Client(project=gcp_project_id)
client.delete_table(bq_table_name, not_found_ok=True)
# Now write with overwrite mode as usual

This forces Spark to create a new table with the schema inferred from your DataFrame.

4. Check DataFrame for Any STRUCTs

Print your DataFrame schema just before writing:

python

df.printSchema()

If you see StructType for source, there is a data issue—ensure the source column always holds string values.

5. Connector Compatibility

Ensure the jar and DBR versions are compatible. Sometimes, subtle connector changes affect schema handling in overwrite mode. Consider upgrading the connector if possible.

Conclusion

The problem arises due to schema type handling differences between append and overwrite modes in Spark-BigQuery connector.
Explicitly casting columns and ensuring table schema matches the DataFrame schema before overwrite are the most reliable fixes.
Dropping the table beforehand can prevent schema legacy mismatches, especially in overwrite flows.

Always test with a small batch to verify if the problem resolves. If the error persists, further debug the raw DataFrame contents or connector settings.