cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Inconsistent UNRESOLVED_COLUMN._metadata error on Serverless compute during MERGE operations

ruicarvalho_de
New Contributor III

Hi.

I've been facing this problem for the last couple of days.

We're experiencing intermittent failures with the error [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name '_metadata' cannot be resolved. SQLSTATE: 42703 when running MERGE operations on Serverless compute. The same code works consistently on job clusters.

We're experiencing intermittent failures with the error [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name '_metadata' cannot be resolved. SQLSTATE: 42703 when running MERGE operations on Serverless compute. The same code works consistently on Job Clusters.

Already tried this about the delta.enableRowTracking issue: https://community.databricks.com/t5/get-started-discussions/cannot-run-merge-statement-in-the-notebo...

Context:
Our ingestion pipeline reads Parquet files from a landing zone and merges them into Delta raw tables. We use the _metadata.file_path virtual column to track source files in a Sys_SourceFile column.

Code Pattern:

# Read parquet
df_landing = spark.read.format('parquet').load(landing_path)

# Add system columns including Sys_SourceFile from _metadata
df = df.withColumn('Sys_SourceFile', col('_metadata.file_path'))

# Create temp view
df.createOrReplaceTempView('landing_data')

# Execute MERGE
spark.sql("""
    MERGE INTO target_table AS raw
    USING landing_data AS landing
    ON landing.pk = raw.pk
    WHEN MATCHED AND landing.Sys_Hash != raw.Sys_Hash
    THEN UPDATE SET ...
    WHEN NOT MATCHED BY TARGET
    THEN INSERT ...
""")

 

Testing & Findings:

_metadata is available after reading to df_landing.

_metadata is available inside the function that adds system columns.

Same table, same parameters, different results:

  • Table A - Fails on Serverless

  • Table Bโ€”with the same config, Works on Serverless

  • Both tables have identical delta.enableRowTracking = true

  • Both use the same code path

Job Cluster: All tables work consistently.

delta.enableRowTracking: found the community post above suggesting this property causes the issue, but we have tables with enableRowTracking = true that work fine on Serverless, while others with the same property fail.

Key Observations:

  • The _metadata virtual column is available at the DataFrame level but gets "lost" somewhere in the execution plan when passed through createOrReplaceTempView() to SQL MERGE.

  • The error only manifests at MERGE execution time, not when adding the column with withColumn()

  • Behavior is non-deterministic... same code, same config, different tables, different results

  • Serverless uses Spark Connect, which "defers analysis and name resolution to execution time"โ€”this seems related, but doesn't explain the inconsistency

Is there a way to work around this? And a solid understanding of why this happens?

Rui Carvalho
Data Engineer
1 REPLY 1

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ruicarvalho_de,

The behavior you are seeing is related to how the _metadata virtual column interacts with Spark Connect (which Serverless compute uses under the hood) and how deferred analysis changes when that hidden column is resolved.

Here is what is happening and how to work around it.


WHY THIS HAPPENS

The _metadata column is a hidden virtual column that Spark injects at read time for file-based data sources. It is not a physical column in the DataFrame schema. When you call createOrReplaceTempView(), the logical plan is serialized and later re-analyzed at execution time. On Serverless compute (Spark Connect), analysis and name resolution are deferred to execution time rather than being eagerly resolved at definition time like on classic Job Clusters.

During a MERGE operation, the query planner needs to resolve columns from both the source (your temp view) and the target (the Delta table). Here is where the conflict arises: Delta tables with delta.enableRowTracking = true also expose their own _metadata struct (containing row_id and row_commit_version). When the MERGE planner encounters _metadata in the execution plan, the deferred resolution on Serverless can become ambiguous or fail to resolve the file-source _metadata, depending on internal plan optimization order. This explains the non-deterministic behavior you observed, where some tables work and others do not despite identical configurations.


THE WORKAROUND

The fix is to materialize the _metadata fields you need into regular, physical columns before creating the temp view. Once you do this, there is no hidden virtual column for the deferred analysis to lose track of. You are already doing this partially with your withColumn('Sys_SourceFile', col('_metadata.file_path')) call, but the key is to also drop the _metadata column (or simply select only the columns you need) before creating the temp view.

Updated code pattern:

# Read parquet
df_landing = spark.read.format('parquet').load(landing_path)

# Add system columns from _metadata FIRST
df = df_landing.withColumn('Sys_SourceFile', col('_metadata.file_path'))

# Select only the columns you need (excluding _metadata)
columns_to_keep = [c for c in df.columns if c != '_metadata']
df_clean = df.select(*columns_to_keep)

# Create temp view from the cleaned DataFrame
df_clean.createOrReplaceTempView('landing_data')

# Execute MERGE - now works consistently on Serverless
spark.sql("""
MERGE INTO target_table AS raw
USING landing_data AS landing
ON landing.pk = raw.pk
WHEN MATCHED AND landing.Sys_Hash != raw.Sys_Hash
THEN UPDATE SET ...
WHEN NOT MATCHED BY TARGET
THEN INSERT ...
""")

By explicitly selecting only the columns you need (which now includes your materialized Sys_SourceFile), the _metadata virtual column is no longer part of the logical plan that gets serialized into the temp view. This removes the ambiguity during deferred resolution on Serverless.


ALTERNATIVE APPROACH: AVOID TEMP VIEWS ENTIRELY

Another option is to skip createOrReplaceTempView() and use the DataFrame API for the MERGE instead. The DeltaTable API keeps the plan in the DataFrame domain where _metadata resolution is more predictable:

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, 'target_table')

target.alias('raw').merge(
df.alias('landing'),
'landing.pk = raw.pk'
).whenMatchedUpdate(
condition='landing.Sys_Hash != raw.Sys_Hash',
set={...}
).whenNotMatchedInsert(
values={...}
).execute()

With this approach the DataFrame plan is resolved directly without the extra serialization step through a temp view and SQL text parsing.


WHY IT IS INCONSISTENT ACROSS TABLES

The non-determinism likely comes from differences in the query plan optimization path. Factors like table size, partition count, statistics availability, and plan caching can all influence the order in which the optimizer resolves columns. On some tables, the optimizer resolves _metadata correctly before the MERGE planner runs. On others, the resolution happens in a different order and fails. This is consistent with your observation that identical configurations produce different results.


ADDITIONAL NOTES

1. The delta.enableRowTracking property is not the root cause by itself, but it adds a second _metadata source into the resolution scope, which increases the chance of ambiguity on Serverless.

2. If you want to confirm this is the issue, you can test by temporarily disabling row tracking on a failing table and running the same MERGE on Serverless. If it succeeds, that confirms the dual _metadata resolution conflict.

3. This pattern of materializing virtual columns before creating temp views is a general best practice for Serverless compute, not just for _metadata. Any hidden or virtual column should be resolved into a physical column before passing through plan boundaries like temp views.

References:
https://docs.databricks.com/en/ingestion/file-metadata-column.html
https://docs.databricks.com/en/delta/row-tracking.html
https://docs.databricks.com/en/compute/serverless/limitations.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.