topic Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON in Data Engineering

PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Thu, 25 Jun 2026 08:18:08 GMT

I'm working on a personal data engineering project using Kafka, Spark Structured Streaming, and Docker.

The application consumes two Kafka topics that originate from an external market-data websocket source:

a trade stream
a candlestick (kline/OHLCV) stream

I'm using the following schemas in my Spark job:

trade_schema = StructType([
    StructField("e", StringType(), True),
    StructField("s", StringType(), True),
    StructField("t", LongType(), True),
    StructField("p", StringType(), True),
    StructField("q", StringType(), True),
    StructField("T", LongType(), True),
    StructField("m", BooleanType(), True)
])

parsed_trade_df = (
    trade_raw_df
    .select(
        from_json(
            col("value").cast("string"),
            trade_schema
        ).alias("json")
    )
    .filter(col("json").isNotNull())
    .select(
        col("json.e").alias("event_type"),
        col("json.s").alias("symbol"),
        col("json.t").alias("trade_id"),
        col("json.p").cast(DecimalType(18, 2)).alias("price"),
        col("json.q").cast(DecimalType(18, 6)).alias("quantity"),
        col("json.T").alias("trade_time_ms"),
        col("json.m").alias("is_buyer_maker")
    )
)

The Spark application fails during parsing with:

AnalysisException:
[AMBIGUOUS_REFERENCE_TO_FIELDS]
Ambiguous reference to the field `t`.
It appears 2 times in the schema.

The traceback points to a .select(...) operation.

I also consume a second stream containing nested structures with fields such as:

{
  "e": "kline",
  "k": {
    "t": 1782371940000,
    "T": 1782371999999
  }
}

What I'm trying to understand is the root cause of Spark reporting an ambiguous reference to t.

My understanding is that Spark should distinguish between:

col("json.t")

and

col("json.k.t")

Questions:

What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS?
Can nested fields in a separate schema cause this error?
Is this usually related to schema definitions, column expansion (select("*"), select("json.*")), joins, or something else?
What debugging steps would you recommend to identify which DataFrame contains the duplicate field?

I'm mainly interested in understanding the cause so I can debug it myself.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Thu, 25 Jun 2026 09:49:50 GMT

Hi,

1. What situations typically trigger AMBIGUOUS_REFERENCE_TO_FIELDS

It occurs when Spark finds multiple columns with the same name at the same nesting level in a Data Frame. It most commonly happens due to

Wildcard expansion - Using .select("json.*") followed by .select("*", "k.*") creates both a struct field k (containing nested t) and a flat field t at the top level
Union/Join collisions: Combining Data Frames that both have fields named t without proper aliasing
Duplicate schema definitions: Defining the same field twice in a StructType

2. Can nested fields in a separate schema cause this error

Not by itself. Trade schema with t and kline schema with k.t are good independently.

Problem arises when you

Expand struct fields with wildcards (select("k.*") promotes nested k.t to top-level t)
Combine both streams without distinct column names

3. Can nested fields in a separate schema cause this error

Wildcard expansion (most common)
Column expansion with select("*") or select("struct_field.*")
Union/join operations without explicit column selection

4. What debugging steps would you recommend to identify which DataFrame contains the duplicate field

You can check parsing code

Print df.columns after each transformation to spot duplicates
Print df.printSchema() to see if it appears at multiple levels
Check for .select("json.*") or .select("*", "k.*") patterns

Use explicit nested field paths with proper aliasing for kline stream like you already do for trade stream.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 09:15:26 GMT

Thank you for the detailed explanation. It helped me understand the common situations that can trigger AMBIGUOUS_REFERENCE_TO_FIELDS.

Based on your suggestions, I reviewed my code again, but I'm still confused because I don't believe any of those cases apply here.

Specifically:

I'm not using .select("json.*") or .select("*").
I'm not expanding k.*.
I'm not performing any joins or unions before this error occurs.
The exception is raised while parsing the trade stream, before any processing of the kline stream begins.

The parsing code is simply:

trade_raw_df = ( spark.readStream .format("kafka") .option("kafka.bootstrap.servers", KAFKA_BROKER) .option("subscribe", TRADE_STREAM_NAME) .option("startingOffsets", "latest") .load() ) # ======================================================================== # TRADE PARSING # ======================================================================== parsed_trade_df = ( trade_raw_df .select( from_json( col("value").cast("string"), trade_schema ).alias("json") ) .filter(col("json").isNotNull()) .select( col("json.e").alias("event_type"), col("json.s").alias("symbol"), col("json.t").alias("trade_id"), col("json.p").cast(DecimalType(18, 2)).alias("price"), col("json.q").cast(DecimalType(18, 6)).alias("quantity"), col("json.T").alias("trade_time_ms"), col("json.m").alias("is_buyer_maker") ) .filter(col("event_type") == "trade") .filter(col("trade_time_ms").isNotNull()) .filter(col("price").isNotNull()) .filter(col("quantity").isNotNull()) ) parsed_trade_df.printSchema() # ======================================================================== # TRADE TRANSFORMATION # ======================================================================== trade_df = ( parsed_trade_df .withColumn( "event_time", (col("trade_time_ms") / 1000).cast("timestamp") ) .withColumn( "total_value_usd", col("price") * col("quantity") ) )

The traceback points to this .select(...) call, which is why I'm struggling to understand where Spark is finding two fields named t.

Would you mind taking a look at this parsing code? If it doesn't immediately stand out, I'm also happy to share the rest of the streaming script if that would help identify where the duplicate field might be coming from.

Thank you again for your guidance.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 09:29:18 GMT

Hi Vikas,

You may share the code directly in a message.

You can follow below

Enable case-sensitive mode spark.conf.set("spark.sql.caseSensitive", "true") & validate the code
Rename Fields immediately after parsing & validate the code.

parsed_trade_df = ( trade_raw_df .select(from_json(col("value").cast("string"), trade_schema).alias("json")) ) changed_df = parsed_trade_df .select(col("json.t").alias("trade_id"),col("json.T").alias("trade_time_ms"), col("json.e").alias("event_type")) )

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 09:54:38 GMT

Hi Vikas, You can resolve the ambiguity by setting the schema upfront and it's a good approach

# Rename fields in schema to avoid differences trade_schema = StructType([ StructField("e", StringType(), True), StructField("s", StringType(), True), StructField("trade_id", LongType(), True), # Renamed from "t" StructField("p", StringType(), True), StructField("q", StringType(), True), StructField("trade_time_ms", LongType(), True), # Renamed from "T" StructField("m", BooleanType(), True) ]) # The JSON keys "t" and "T" map to schema fields "trade_id" and "trade_time_ms" parsed_trade_df = ( trade_raw_df .select(from_json(col("value").cast("string"), trade_schema).alias("json")) .filter(col("json").isNotNull()) .select( col("json.e").alias("event_type"), col("json.s").alias("symbol"), col("json.trade_id").alias("trade_id"), col("json.p").cast(DecimalType(18, 2)).alias("price"), col("json.q").cast(DecimalType(18, 6)).alias("quantity"), col("json.trade_time_ms").alias("trade_time_ms"), col("json.m").alias("is_buyer_maker") ) .filter(col("event_type") == "trade") .filter(col("trade_time_ms").isNotNull()) .filter(col("price").isNotNull()) .filter(col("quantity").isNotNull()) )

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 10:56:52 GMT

I haven't tried enabling case-sensitive mode yet, but I'll test it and report back with the results.

In the meantime, I'm also attaching my 'spark_streaming.py' script. If you have a chance to review it, I'd really appreciate it. I'm wondering if there's something in my parsing or transformation logic that's causing Spark to report the ambiguous field reference.

Thank you again for your guidance.

spark_streaming script

mport os from pyspark.sql import SparkSession from pyspark.sql.functions import col, from_json from pyspark.sql.types import ( StructType, StructField, StringType, DecimalType, LongType, BooleanType ) # ======================================================================== # CONFIG # ======================================================================== RUN_MODE = os.getenv("RUN_MODE", "LOCAL") KAFKA_BROKER = os.getenv("KAFKA_BROKER", "kafka:29092") TRADE_STREAM_NAME = "crypto-trade-stream" KLINE_STREAM_NAME = "crypto-kline-stream" WHALE_THRESHOLD_USD = "50000" print(f"⚙️ Starting Spark Streaming [{RUN_MODE}]") # ======================================================================== # TRADE STREAM SCHEMA (MICRO VIEW) # ======================================================================== trade_schema = StructType([ StructField("e", StringType(), True), StructField("s", StringType(), True), StructField("trade_id", LongType(), True), StructField("p", StringType(), True), StructField("q", StringType(), True), StructField("trade_time_ms", LongType(), True), StructField("m", BooleanType(), True) ]) # ======================================================================== # KLINE STREAM SCHEMA (MACRO VIEW) # ======================================================================== kline_schema = StructType([ StructField("e", StringType(), True), StructField("s", StringType(), True), StructField("E", LongType(), True), StructField( "k", StructType([ StructField("t", LongType(), True), StructField("T", LongType(), True), StructField("i", StringType(), True), StructField("o", StringType(), True), StructField("h", StringType(), True), StructField("l", StringType(), True), StructField("c", StringType(), True), StructField("v", StringType(), True), StructField("q", StringType(), True), StructField("n", LongType(), True), StructField("x", BooleanType(), True) ]) ) ]) # ======================================================================== # SPARK SESSION # ======================================================================== builder = ( SparkSession.builder .appName("CryptoWhaleStreamingEngine") ) if RUN_MODE == "LOCAL": builder = builder.master("spark://spark-master:7077") spark = builder.getOrCreate() spark.sparkContext.setLogLevel("WARN") # ======================================================================== # TRADE STREAM SOURCE # ======================================================================== trade_raw_df = ( spark.readStream .format("kafka") .option("kafka.bootstrap.servers", KAFKA_BROKER) .option("subscribe", TRADE_STREAM_NAME) .option("startingOffsets", "latest") .load() ) # ======================================================================== # TRADE PARSING # ======================================================================== parsed_trade_df = ( trade_raw_df .select( from_json( col("value").cast("string"), trade_schema ).alias("json") ) .filter(col("json").isNotNull()) .select( col("json.e").alias("event_type"), col("json.s").alias("symbol"), col("json.t").alias("trade_id"), col("json.p").cast(DecimalType(18, 2)).alias("price"), col("json.q").cast(DecimalType(18, 6)).alias("quantity"), col("json.T").alias("trade_time_ms"), col("json.m").alias("is_buyer_maker") ) .filter(col("event_type") == "trade") .filter(col("trade_time_ms").isNotNull()) .filter(col("price").isNotNull()) .filter(col("quantity").isNotNull()) ) parsed_trade_df.printSchema() # ======================================================================== # TRADE TRANSFORMATION # ======================================================================== trade_df = ( parsed_trade_df .withColumn( "event_time", (col("trade_time_ms") / 1000).cast("timestamp") ) .withColumn( "total_value_usd", col("price") * col("quantity") ) ) # ======================================================================== # WATERMARK # ======================================================================== watermarked_trade_df = ( trade_df.withWatermark( "event_time", "10 minutes" ) ) # ======================================================================== # WHALE DETECTION # ======================================================================== whale_df = ( watermarked_trade_df .filter(col("total_value_usd") >= WHALE_THRESHOLD_USD) ) print("✅ Starting KLINE source") # ======================================================================== # KLINE STREAM SOURCE # ======================================================================== kline_raw_df = ( spark.readStream .format("kafka") .option("kafka.bootstrap.servers", KAFKA_BROKER) .option("subscribe", KLINE_STREAM_NAME) .option("startingOffsets", "latest") .load() ) # ======================================================================== # KLINE PARSING # ======================================================================== parsed_kline_df = ( kline_raw_df .select( from_json( col("value").cast("string"), kline_schema ).alias("json") ) .filter(col("json").isNotNull()) .select( col("json.e").alias("event_type"), col("json.s").alias("symbol"), col("json.E").alias("event_time_ms"), col("json.k.t").alias("candle_start_ms"), col("json.k.T").alias("candle_close_ms"), col("json.k.i").alias("interval"), col("json.k.o").cast(DecimalType(18, 2)).alias("open_price"), col("json.k.h").cast(DecimalType(18, 2)).alias("high_price"), col("json.k.l").cast(DecimalType(18, 2)).alias("low_price"), col("json.k.c").cast(DecimalType(18, 2)).alias("close_price"), col("json.k.v").cast(DecimalType(18, 6)).alias("base_volume"), col("json.k.q").cast(DecimalType(18, 2)).alias("quote_volume"), col("json.k.n").alias("trades_count"), col("json.k.x").alias("is_candle_closed") ) .filter(col("event_type") == "kline") .filter(col("event_time_ms").isNotNull()) ) parsed_kline_df.printSchema() # ======================================================================== # KLINE TRANSFORMATION # ======================================================================== kline_df = ( parsed_kline_df .withColumn( "event_time", (col("event_time_ms") / 1000).cast("timestamp") ) ) # ======================================================================== # DEBUG FUNCTIONS # ======================================================================== def debug_whales(df, batch_id): count = df.count() if count > 0: print( f"🐋 Batch {batch_id}: " f"{count} whale trades detected" ) def debug_klines(df, batch_id): count = df.count() if count > 0: print( f"📈 Batch {batch_id}: " f"{count} kline records processed" ) # ======================================================================== # DEBUG STREAMS # ======================================================================== trade_debug_query = ( whale_df.writeStream .queryName("whale_debug") .foreachBatch(debug_whales) .start() ) kline_debug_query = ( kline_df.writeStream .queryName("kline_debug") .foreachBatch(debug_klines) .start() ) # ======================================================================== # WHALE PARQUET SINK # ======================================================================== whale_query = ( whale_df.writeStream .queryName("whale_alerts") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/whale_alerts" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/whale_alerts" ) .trigger(processingTime="10 seconds") .start() ) # ======================================================================== # KLINE PARQUET SINK # ======================================================================== kline_query = ( kline_df.writeStream .queryName("candlestick_history") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/candlesticks" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/candlesticks" ) .trigger(processingTime="10 seconds") .start() ) print("🚀 Whale detection pipeline running") print("🚀 Candlestick pipeline running") spark.streams.awaitAnyTermination()

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 11:03:07 GMT

Do check the other 2 options listed above too - upfront schema setup & field renaming

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 11:54:44 GMT

Thank you, balajij8, for your suggestion about enabling case-sensitive mode. It worked! The process now moves past the previous error, and Spark is successfully consuming data from Kafka.

However, it looks like I've run into another issue. Although the streaming job is consuming the data, it doesn't appear to be writing any Parquet files as expected.

I do see the checkpoint directories being created correctly, both inside the Spark container and on my local machine through the mounted volume, so it seems the streaming queries are running. The only thing missing is the Parquet output.

I'll investigate this next, but if you have any suggestions about what might cause Spark Structured Streaming to create checkpoints without writing any output files, I'd really appreciate your guidance.

following is my Parquet sink:

whale_query = ( whale_df.writeStream .queryName("whale_alerts") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/whale_alerts" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/whale_alerts" ) .trigger(processingTime="10 seconds") .start() ) # ======================================================================== # KLINE PARQUET SINK # ======================================================================== kline_query = ( kline_df.writeStream .queryName("candlestick_history") .format("parquet") .outputMode("append") .option( "path", "/opt/spark/app/data/candlesticks" ) .option( "checkpointLocation", "/opt/spark/app/checkpoints/candlesticks" ) .trigger(processingTime="10 seconds") .start() ) print("🚀 Whale detection pipeline running") print("🚀 Candlestick pipeline running") spark.streams.awaitAnyTermination()

Thank you again for your help!

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 11:59:37 GMT

I have send reply for this message 5 times already I don't know what is going on here

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 12:12:10 GMT

The configuration is correct & mostly upstream is the issue. The Parquet sink can only write files when it receives data from the upstream. You can validate the 2 key configurations given below

startingOffsets - latest - Code skips all historical Kafka data and it only processes messages that arrive after the stream starts. You can set it to earliest & validate
WHALE_THRESHOLD_USD 50000 - Typical value can be 5 - 10. You can lower the threshold & validate temporarily and set it to 50000 later

Even if Kafka has messages the pipeline filters out them because of the configurations.

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 12:33:26 GMT

Thank you, balajij8, for your suggestions. I really appreciate your time and guidance.

I'll try the different configurations you recommended and investigate further. Once I've tested them, I'll come back and share the results.

Thanks again for your help!

P.S. "Did you see the messages I have already sent... I still don't see them above?"

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

VikasM — Fri, 26 Jun 2026 13:18:25 GMT

Hello balajij8,

Before trying your suggestions, I decided to inspect the filesystem inside my Spark container once more.

I found something that has changed my understanding of the problem. There are no errors being reported by the streaming job, and the checkpoint and _spark_metadata directories are being updated continuously. I also found metadata entries that indicate Spark believes it has successfully written Parquet files.

However, I cannot find the actual part-*.snappy.parquet files in the output directory, even though the metadata references them. For example:

$ cd _spark_metadata
$ ls
0 1 2 3
$ cat 1
v1
{"path":"file:///opt/spark/app/data/whale_alerts/part-00000-ac552411-0fa6-47c8-b120-4dfcc9227b09-c000.snappy.parquet","size":1125,"isDir":false,"modificationTime":1782477948968,"blockReplication":1,"blockSize":33554432,"action":"add"}

But when I run:

find /opt/spark/app/data -name "*.parquet"

no Parquet files are found, either inside the container or on my host machine. Only the _spark_metadata files exist.

Since the streaming job is processing records successfully and the metadata is being written, I'm now wondering whether this is related to the file sink, filesystem, or Docker volume configuration rather than the upstream pipeline.

Before I start changing the Kafka configuration or thresholds, do you have any thoughts on why Spark would generate metadata entries without the corresponding Parquet files?

Re: PySpark AnalysisException: Ambiguous reference to field t when parsing nested JSON

balajij8 — Fri, 26 Jun 2026 13:37:55 GMT

Spark Structured Streaming writes to file sinks and generally it uses a phased commit by writing temporary files to the output directory followed by writing metadata with references and a final commit by moving/renaming temp files to final names.

You can verify the Docker side volume mount misconfigurations as some docker configurations use temporary filesystems that get cleaned up or a background process removes the files. The files are written but immediately deleted.

You can also verify that /opt/spark/app/data is actually mounted to the host & ensure that the permissions of _spark_metadata directories and the other directories remain the same - read/write for Spark to perform all operations seamlessly.

You can change the code to write data to a path that has read/write access for Spark to perform all operations & validate & confirm.