PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM
Visitor

I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.

The application consumes two Kafka topics that originate from an external market-data websocket source:

  • a trade stream

  • a candlestick (kline/OHLCV) stream

I'm using the following schemas in my Spark job:

trade_schema = StructType([
    StructField("e", StringType(), True),
    StructField("s", StringType(), True),
    StructField("t", LongType(), True),
    StructField("p", StringType(), True),
    StructField("q", StringType(), True),
    StructField("T", LongType(), True),
    StructField("m", BooleanType(), True)
])
parsed_trade_df = (
    trade_raw_df
    .select(
        from_json(
            col("value").cast("string"),
            trade_schema
        ).alias("json")
    )
    .filter(col("json").isNotNull())
    .select(
        col("json.e").alias("event_type"),
        col("json.s").alias("symbol"),
        col("json.t").alias("trade_id"),
        col("json.p").cast(DecimalType(18, 2)).alias("price"),
        col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
        col("json.T").alias("trade_time_ms"),
        col("json.m").alias("is_buyer_maker")
    )
)

The Spark application fails during parsing with:

AnalysisException:
[AMBIGUOUS_REFERENCE_TO_FIELDS]
Ambiguous reference to the field `t`.
It appears 2 times in the schema.

The traceback points to a .select(...) operation.

I also consume a second stream containing nested structures with fields such as:

{
  "e": "kline",
  "k": {
    "t": 1782371940000,
    "T": 1782371999999
  }
}

What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.

My understanding is that Spark should distinguish between:

col("json.t")

and

col("json.k.t")

Questions:

  1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?

  2. Can nested fields in a separate schema cause this error?

  3. Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?

  4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field?

I'm mainly interested in understanding the cause so I can debug it myself.

balajij8
Contributor III

Hi,

1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS

It occurs when Spark finds multiple columns with the same name at the same nesting level in a Data Frame. It most commonly happens due to 

  • Wildcard expansion - Using .select("json.*") followed by .select("*", "k.*") creates both a struct field k (containing nested t) and a flat field t at the top level
  • Union/Join collisions: Combining Data Frames that both have fields named t without proper aliasing
  • Duplicate schema definitions: Defining the same field twice in a StructType

2. Can nested fields in a separate schema cause this error

Not by itself. Trade schema with t and kline schema with k.t are good independently.

Problem arises when you

  • Expand struct fields with wildcards (select("k.*") promotes nested k.t to top-level t)
  • Combine both streams without distinct column names

3. Can nested fields in a separate schema cause this error

  • Wildcard expansion (most common)
  • Column expansion with select("*") or select("struct_field.*")
  • Union/join operations without explicit column selection

4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field

You can check parsing code

  • Print df.columns after each transformation to spot duplicates
  • Print df.printSchema() to see if it appears at multiple levels
  • Check for .select("json.*") or .select("*", "k.*") patterns

Use explicit nested field paths with proper aliasing for kline stream like you already do for trade stream.