- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2026 03:52 AM
This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.
Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.
Let me address your questions one by one.
Why this happens (important concept)
For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:
- file path
- file name
- file modification timestamp
Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.
So even with:
- a new checkpoint location
- a new output Delta path
Spark still sees no new files, because the files already exist in the source directory and have not changed.
That’s why you see:
- Rows read = 0
- output path not created
1. How to process the same input files again with modified transformations?
You have three correct options, depending on intent:
Option A (most common & recommended)
Use batch processing instead of streaming:
This is the cleanest approach when you want to reprocess historical/static files.
Option B
Move or copy the files into a new source directory and start a new stream from there.
Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).
2. How to reprocess existing files using Spark Streaming?
Structured Streaming is not designed for replaying static files.
Streaming is meant for:
- append‑only data
- new files arriving over time
If you need replay / backfill / re‑runs:
- use batch jobs
- or read from Delta tables (not raw files)
A very common pattern is:
Raw files → Batch ingest → Bronze Delta Bronze Delta → Streaming / Batch transforms
Delta supports time travel and replay; file streaming does not.
3. Should the output folder be outside the source folder?
Yes — absolutely recommended.
Keeping output or checkpoint paths inside the source directory can cause:
- file discovery confusion
- recursive reads
- unexpected behavior
Best practice:
- Source path: input‑only
- Output path: separate directory
- Checkpoint path: separate directory
4. If I change output path & checkpoint, how does Spark still know the file was processed?
Because file discovery happens before checkpointing.
Checkpoint stores:
- offsets
- progress
- execution state
But file streaming also relies on:
- directory listing
- file metadata
Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.
Key takeaway
Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files
If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.
Hope this helps clarify how file streaming works internally.