PySpark AnalysisException: Ambiguous reference to ...

VikasM · 9 hours ago

I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.

The application consumes two Kafka topics that originate from an external market-data websocket source:

a trade stream
a candlestick (kline/OHLCV) stream

I'm using the following schemas in my Spark job:

trade_schema = StructType([
    StructField("e", StringType(), True),
    StructField("s", StringType(), True),
    StructField("t", LongType(), True),
    StructField("p", StringType(), True),
    StructField("q", StringType(), True),
    StructField("T", LongType(), True),
    StructField("m", BooleanType(), True)
])

parsed_trade_df = (
    trade_raw_df
    .select(
        from_json(
            col("value").cast("string"),
            trade_schema
        ).alias("json")
    )
    .filter(col("json").isNotNull())
    .select(
        col("json.e").alias("event_type"),
        col("json.s").alias("symbol"),
        col("json.t").alias("trade_id"),
        col("json.p").cast(DecimalType(18, 2)).alias("price"),
        col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
        col("json.T").alias("trade_time_ms"),
        col("json.m").alias("is_buyer_maker")
    )
)

The Spark application fails during parsing with:

AnalysisException:
[AMBIGUOUS_REFERENCE_TO_FIELDS]
Ambiguous reference to the field `t`.
It appears 2 times in the schema.

The traceback points to a .select(...) operation.

I also consume a second stream containing nested structures with fields such as:

{
  "e": "kline",
  "k": {
    "t": 1782371940000,
    "T": 1782371999999
  }
}

What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.

My understanding is that Spark should distinguish between:

col("json.t")

and

col("json.k.t")

Questions:

What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?
Can nested fields in a separate schema cause this error?
Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?
What debugging steps would you recommend to identify which DataFrame contains the duplicate field?

I'm mainly interested in understanding the cause so I can debug it myself.

balajij8 · 7 hours ago

Hi,

1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS

It occurs when Spark finds multiple columns with the same name at the same nesting level in a Data Frame. It most commonly happens due to

Wildcard expansion - Using .select("json.*") followed by .select("*", "k.*") creates both a struct field k (containing nested t) and a flat field t at the top level
Union/Join collisions: Combining Data Frames that both have fields named t without proper aliasing
Duplicate schema definitions: Defining the same field twice in a StructType

2. Can nested fields in a separate schema cause this error

Not by itself. Trade schema with t and kline schema with k.t are good independently.

Problem arises when you

Expand struct fields with wildcards (select("k.*") promotes nested k.t to top-level t)
Combine both streams without distinct column names

3. Can nested fields in a separate schema cause this error

Wildcard expansion (most common)
Column expansion with select("*") or select("struct_field.*")
Union/join operations without explicit column selection

4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field

You can check parsing code

Print df.columns after each transformation to spot duplicates
Print df.printSchema() to see if it appears at multiple levels
Check for .select("json.*") or .select("*", "k.*") patterns

Use explicit nested field paths with proper aliasing for kline stream like you already do for trade stream.