I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.
The application consumes two Kafka topics that originate from an external market-data websocket source:
I'm using the following schemas in my Spark job:
trade_schema = StructType([
StructField("e", StringType(), True),
StructField("s", StringType(), True),
StructField("t", LongType(), True),
StructField("p", StringType(), True),
StructField("q", StringType(), True),
StructField("T", LongType(), True),
StructField("m", BooleanType(), True)
])parsed_trade_df = (
trade_raw_df
.select(
from_json(
col("value").cast("string"),
trade_schema
).alias("json")
)
.filter(col("json").isNotNull())
.select(
col("json.e").alias("event_type"),
col("json.s").alias("symbol"),
col("json.t").alias("trade_id"),
col("json.p").cast(DecimalType(18, 2)).alias("price"),
col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
col("json.T").alias("trade_time_ms"),
col("json.m").alias("is_buyer_maker")
)
)The Spark application fails during parsing with:
AnalysisException:
[AMBIGUOUS_REFERENCE_TO_FIELDS]
Ambiguous reference to the field `t`.
It appears 2 times in the schema.
The traceback points to a .select(...) operation.
I also consume a second stream containing nested structures with fields such as:
{
"e": "kline",
"k": {
"t": 1782371940000,
"T": 1782371999999
}
}What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.
My understanding is that Spark should distinguish between:
col("json.t")and
col("json.k.t")Questions:
What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?
Can nested fields in a separate schema cause this error?
Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?
What debugging steps would you recommend to identify which DataFrame contains the duplicate field?
I'm mainly interested in understanding the cause so I can debug it myself.