Re: PySpark AnalysisException: Ambiguous reference...

VikasM · yesterday

Thank you, balajij8, for your suggestion about enabling case-sensitive mode. It worked! The process now moves past the previous error, and Spark is successfully consuming data from Kafka.

However, it looks like I've run into another issue. Although the streaming job is consuming the data, it doesn't appear to be writing any Parquet files as expected.

I do see the checkpoint directories being created correctly, both inside the Spark container and on my local machine through the mounted volume, so it seems the streaming queries are running. The only thing missing is the Parquet output.

I'll investigate this next, but if you have any suggestions about what might cause Spark Structured Streaming to create checkpoints without writing any output files, I'd really appreciate your guidance.

following is my Parquet sink:

whale_query = (
    whale_df.writeStream
    .queryName("whale_alerts")
    .format("parquet")
    .outputMode("append")
    .option(
        "path",
        "/opt/spark/app/data/whale_alerts"
    )
    .option(
        "checkpointLocation",
        "/opt/spark/app/checkpoints/whale_alerts"
    )
    .trigger(processingTime="10 seconds")
    .start()
)

# ========================================================================
# KLINE PARQUET SINK
# ========================================================================
kline_query = (
    kline_df.writeStream
    .queryName("candlestick_history")
    .format("parquet")
    .outputMode("append")
    .option(
        "path",
        "/opt/spark/app/data/candlesticks"
    )
    .option(
        "checkpointLocation",
        "/opt/spark/app/checkpoints/candlesticks"
    )
    .trigger(processingTime="10 seconds")
    .start()
)

print("🚀 Whale detection pipeline running")
print("🚀 Candlestick pipeline running")

spark.streams.awaitAnyTermination()

Thank you again for your help!