Re: PySpark AnalysisException: Ambiguous reference...

VikasM · yesterday

Thanks for your reply.

I investigated the output directories a bit further before trying another path. If my understanding is correct, the volume mount and read/write permissions do not seem to be the issue in my case. The reason I think this is that both the Docker container and my local machine continuously create and update the checkpoints and data directories. The checkpoint files, offsets, commits, and _spark_metadata are all being written successfully,
which suggests that Spark can write to the mounted volume.

What confuses me is that _spark_metadata contains entries such as:

{"path":"file:///opt/spark/app/data/whale_alerts/part-00000-ac552411-0fa6-47c8-b120-4dfcc9227b09-c000.snappy.parquet","size":1125,"isDir":false,"modificationTime":1782477948968,"blockReplication":1,"blockSize":33554432,"action":"add"}

which indicates that Spark believes a Parquet file was committed. However, when I search both inside the container and on the host, the referenced part-*.snappy.parquet files do not exist—only the _spark_metadata directory is present. Could this indicate an issue during the file commit phase rather than a volume mount or permission problem? If so, are there any Spark or Hadoop configurations that you would recommend checking next?