cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM
Visitor

I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.

The application consumes two Kafka topics that originate from an external market-data websocket source:

  • a trade stream

  • a candlestick (kline/OHLCV) stream

I'm using the following schemas in my Spark job:

trade_schema = StructType([
    StructField("e", StringType(), True),
    StructField("s", StringType(), True),
    StructField("t", LongType(), True),
    StructField("p", StringType(), True),
    StructField("q", StringType(), True),
    StructField("T", LongType(), True),
    StructField("m", BooleanType(), True)
])
parsed_trade_df = (
    trade_raw_df
    .select(
        from_json(
            col("value").cast("string"),
            trade_schema
        ).alias("json")
    )
    .filter(col("json").isNotNull())
    .select(
        col("json.e").alias("event_type"),
        col("json.s").alias("symbol"),
        col("json.t").alias("trade_id"),
        col("json.p").cast(DecimalType(18, 2)).alias("price"),
        col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
        col("json.T").alias("trade_time_ms"),
        col("json.m").alias("is_buyer_maker")
    )
)

The Spark application fails during parsing with:

AnalysisException:
[AMBIGUOUS_REFERENCE_TO_FIELDS]
Ambiguous reference to the field `t`.
It appears 2 times in the schema.

The traceback points to a .select(...) operation.

I also consume a second stream containing nested structures with fields such as:

{
  "e": "kline",
  "k": {
    "t": 1782371940000,
    "T": 1782371999999
  }
}

What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.

My understanding is that Spark should distinguish between:

col("json.t")

and

col("json.k.t")

Questions:

  1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?

  2. Can nested fields in a separate schema cause this error?

  3. Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?

  4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field?

I'm mainly interested in understanding the cause so I can debug it myself.

1 REPLY 1

balajij8
Contributor III

Hi,

1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS

It occurs when Spark finds multiple columns with the same name at the same nesting level in a Data Frame. It most commonly happens due to 

  • Wildcard expansion - Using .select("json.*") followed by .select("*", "k.*") creates both a struct field k (containing nested t) and a flat field t at the top level
  • Union/Join collisions: Combining Data Frames that both have fields named t without proper aliasing
  • Duplicate schema definitions: Defining the same field twice in a StructType

2. Can nested fields in a separate schema cause this error

Not by itself. Trade schema with t and kline schema with k.t are good independently.

Problem arises when you

  • Expand struct fields with wildcards (select("k.*") promotes nested k.t to top-level t)
  • Combine both streams without distinct column names

3. Can nested fields in a separate schema cause this error

  • Wildcard expansion (most common)
  • Column expansion with select("*") or select("struct_field.*")
  • Union/join operations without explicit column selection

4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field

You can check parsing code

  • Print df.columns after each transformation to spot duplicates
  • Print df.printSchema() to see if it appears at multiple levels
  • Check for .select("json.*") or .select("*", "k.*") patterns

Use explicit nested field paths with proper aliasing for kline stream like you already do for trade stream.