- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday - last edited yesterday
Hello balajij8,
Before trying your suggestions, I decided to inspect the filesystem inside my Spark container once more.
I found something that has changed my understanding of the problem. There are no errors being reported by the streaming job, and the checkpoint and _spark_metadata directories are being updated continuously. I also found metadata entries that indicate Spark believes it has successfully written Parquet files.
However, I cannot find the actual part-*.snappy.parquet files in the output directory, even though the metadata references them. For example:
$ cd _spark_metadata
$ ls
0 1 2 3
$ cat 1
v1
{"path":"file:///opt/spark/app/data/whale_alerts/part-00000-ac552411-0fa6-47c8-b120-4dfcc9227b09-c000.snappy.parquet","size":1125,"isDir":false,"modificationTime":1782477948968,"blockReplication":1,"blockSize":33554432,"action":"add"}
But when I run:
find /opt/spark/app/data -name "*.parquet"
no Parquet files are found, either inside the container or on my host machine. Only the _spark_metadata files exist.
Since the streaming job is processing records successfully and the metadata is being written, I'm now wondering whether this is related to the file sink, filesystem, or Docker volume configuration rather than the upstream pipeline.
Before I start changing the Kafka configuration or thresholds, do you have any thoughts on why Spark would generate metadata entries without the corresponding Parquet files?