PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
9 hours ago
I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.
The application consumes two Kafka topics that originate from an external market-data websocket source:
a trade stream
a candlestick (kline/OHLCV) stream
I'm using the following schemas in my Spark job:
trade_schema = StructType([
StructField("e", StringType(), True),
StructField("s", StringType(), True),
StructField("t", LongType(), True),
StructField("p", StringType(), True),
StructField("q", StringType(), True),
StructField("T", LongType(), True),
StructField("m", BooleanType(), True)
])parsed_trade_df = (
trade_raw_df
.select(
from_json(
col("value").cast("string"),
trade_schema
).alias("json")
)
.filter(col("json").isNotNull())
.select(
col("json.e").alias("event_type"),
col("json.s").alias("symbol"),
col("json.t").alias("trade_id"),
col("json.p").cast(DecimalType(18, 2)).alias("price"),
col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
col("json.T").alias("trade_time_ms"),
col("json.m").alias("is_buyer_maker")
)
)The Spark application fails during parsing with:
AnalysisException: [AMBIGUOUS_REFERENCE_TO_FIELDS] Ambiguous reference to the field `t`. It appears 2 times in the schema.
The traceback points to a .select(...) operation.
I also consume a second stream containing nested structures with fields such as:
{
"e": "kline",
"k": {
"t": 1782371940000,
"T": 1782371999999
}
}What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.
My understanding is that Spark should distinguish between:
col("json.t")and
col("json.k.t")Questions:
What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?
Can nested fields in a separate schema cause this error?
Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?
What debugging steps would you recommend to identify which DataFrame contains the duplicate field?
I'm mainly interested in understanding the cause so I can debug it myself.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
7 hours ago
Hi,
1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS
It occurs when Spark finds multiple columns with the same name at the same nesting level in a Data Frame. It most commonly happens due to
- Wildcard expansion - Using .select("json.*") followed by .select("*", "k.*") creates both a struct field k (containing nested t) and a flat field t at the top level
- Union/Join collisions: Combining Data Frames that both have fields named t without proper aliasing
- Duplicate schema definitions: Defining the same field twice in a StructType
2. Can nested fields in a separate schema cause this error
Not by itself. Trade schema with t and kline schema with k.t are good independently.
Problem arises when you
- Expand struct fields with wildcards (select("k.*") promotes nested k.t to top-level t)
- Combine both streams without distinct column names
3. Can nested fields in a separate schema cause this error
- Wildcard expansion (most common)
- Column expansion with select("*") or select("struct_field.*")
- Union/join operations without explicit column selection
4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field
You can check parsing code
- Print df.columns after each transformation to spot duplicates
- Print df.printSchema() to see if it appears at multiple levels
- Check for .select("json.*") or .select("*", "k.*") patterns
Use explicit nested field paths with proper aliasing for kline stream like you already do for trade stream.