cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

data not inserting in 'overwrite' mode - Value has type STRUCT which cannot be inserted into column

soumiknow
Contributor II

We have the following code which we used to load data to BigQuery table after reading the parquet files from Azure Data Lake Storage:

df.write.format("bigquery").option( "parentProject", gcp_project_id ).option("table", f"{bq_table_name}").option( "temporaryGcsBucket", f"{temp_gcs_bucket}" ).option( "spark.sql.sources.partitionOverwriteMode", "DYNAMIC" ).option( "writeMethod", "indirect" ).mode( "overwrite" ).save()

This code was working, but for last week onwards we are getting the following exception:

bigquery.storageapi.shaded.com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Query error: Value has type STRUCT which cannot be inserted into column source, which has type STRING at [7:2393]

Even though the exception clearly mention that the incoming data has STRUCT data type instead of STRING data type. But we cross checked the data which was not at all in STRUCT.

However, when we changed the mode to 'append' from 'overwrite'; the same data got loaded successfully.

To re-confirm the incoming data type, we loaded the specific 'source' column data to a temporary table which is being created automatically by providing the dataset name in 'append' mode. On newly created table, the same 'source' column data loaded with 'STRING' data type only.

So, the issue is happening with 'overwrite' method only.

We are using DBR 16.1 and spark-3.5-bigauery-0.41.0.jar.

Please suggest how to overcome this issue with 'overwrite' method.

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

The issue you are facing arises when using mode("overwrite") with Spark to load data into BigQuery—the error indicates BigQuery expects a STRING type for the source column, but it is being supplied a STRUCT type during overwrite operations. Strangely, the same data loads fine in append mode and the temporary table shows the correct STRING type for source as well.

Key Points from Your Case

  • Append mode works: The data writes as expected and BigQuery's schema keeps source as STRING.

  • Overwrite mode fails: Exception about STRUCT vs STRING mismatch, even when data appears correctly formatted.

  • DBR version and connector: Using Databricks DBR 16.1 and spark-3.5-bigquery-0.41.0 connector.

  • Schema inference: Spark infers schemas for columns unless explicitly provided.

  • Temporary tables during overwrite: Overwrite operations sometimes create/rely on temp tables or rely on schema merging logic, which can cause mismatches if the source DataFrame and the target table schema don't align perfectly.


Why the Issue Occurs

When using overwrite, Spark replaces the table's data and often needs to reconcile the schema between your DataFrame and the existing BigQuery table schema. If your DataFrame column (e.g., source) has nested fields or is null, Spark/BigQuery may interpret this as STRUCT—especially if any row contains a dictionary/object instead of a plain string.

  • In append mode, BigQuery applies type conversion based on incoming data; new tables get string columns because your DataFrame’s DDL shows STRING type.

  • In overwrite mode, schema merging can be stricter or misinterpret ambiguous types (e.g., all-null columns or mix of types), leading to a STRUCT-STRING mismatch.


Solutions & Workarounds

1. Explicitly Cast Columns in DataFrame

Before writing, make sure all columns match the expected types:

python
from pyspark.sql.functions import col df = df.withColumn("source", col("source").cast("string"))
  • This ensures Spark sees source as STRING regardless of any ambiguous values.

2. Set Schema Explicitly in BigQuery

If possible, create your BigQuery table with an explicit schema before overwrite, ensuring all columns have correct types.

3. Drop and Recreate Table Before Overwrite

Since BigQuery’s overwrite does not always fully drop old schema metadata (especially with partitioned tables), drop the table before writing:

python
from google.cloud import bigquery client = bigquery.Client(project=gcp_project_id) client.delete_table(bq_table_name, not_found_ok=True) # Now write with overwrite mode as usual

This forces Spark to create a new table with the schema inferred from your DataFrame.

4. Check DataFrame for Any STRUCTs

Print your DataFrame schema just before writing:

python
df.printSchema()

If you see StructType for source, there is a data issue—ensure the source column always holds string values.

5. Connector Compatibility

Ensure the jar and DBR versions are compatible. Sometimes, subtle connector changes affect schema handling in overwrite mode. Consider upgrading the connector if possible.


Conclusion

  • The problem arises due to schema type handling differences between append and overwrite modes in Spark-BigQuery connector.

  • Explicitly casting columns and ensuring table schema matches the DataFrame schema before overwrite are the most reliable fixes.

  • Dropping the table beforehand can prevent schema legacy mismatches, especially in overwrite flows.

Always test with a small batch to verify if the problem resolves. If the error persists, further debug the raw DataFrame contents or connector settings.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now