- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2026 12:18 AM
Hi everyone,
I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.
My setup:
Source folder:
/Volumes/workspace/streaming/stream
I am reading JSON files using Spark Streaming with a predefined schema.
df = spark.readStream \
.schema(schema) \
.option("multiLine","true") \
.json("/Volumes/workspace/streaming/stream")
Then I perform some transformations and write the output to Delta:
transform_df.writeStream \
.format("delta") \
.outputMode("append") \
.trigger(once=True) \
.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \
.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \
.start()
Scenario:
1. I uploaded a JSON file into the source folder.
2. Ran the stream → the data was processed successfully.
3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.
However, when I run the stream again it shows:
Rows read = 0
Bytes written = 0
The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:
PATH_NOT_FOUND
My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.
My questions:
1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?
2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?
3. Is it recommended to keep the output folder outside the source folder when using file streaming?
4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?
Any clarification would help me understand how Spark detects and processes files in this scenario.
Thanks!