Mridu
Databricks Partner

This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.

Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.

Let me address your questions one by one.


Why this happens (important concept)

For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:

  • file path
  • file name
  • file modification timestamp

Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.

So even with:

  • a new checkpoint location 
  • a new output Delta path 

Spark still sees no new files, because the files already exist in the source directory and have not changed.

That’s why you see:

  • Rows read = 0
  • output path not created

1. How to process the same input files again with modified transformations?

You have three correct options, depending on intent:

Option A (most common & recommended)
Use batch processing instead of streaming:

 
df = spark.read.schema(schema).json("/Volumes/workspace/streaming/stream")

This is the cleanest approach when you want to reprocess historical/static files.

Option B
Move or copy the files into a new source directory and start a new stream from there.

Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).


2. How to reprocess existing files using Spark Streaming?

Structured Streaming is not designed for replaying static files.

Streaming is meant for:

  • append‑only data
  • new files arriving over time

If you need replay / backfill / re‑runs:

  • use batch jobs
  • or read from Delta tables (not raw files)

A very common pattern is:

Raw files → Batch ingest → Bronze Delta
Bronze Delta → Streaming / Batch transforms

Delta supports time travel and replay; file streaming does not.


3. Should the output folder be outside the source folder?

Yes — absolutely recommended.

Keeping output or checkpoint paths inside the source directory can cause:

  • file discovery confusion
  • recursive reads
  • unexpected behavior

Best practice:

  • Source path: input‑only
  • Output path: separate directory
  • Checkpoint path: separate directory

4. If I change output path & checkpoint, how does Spark still know the file was processed?

Because file discovery happens before checkpointing.

Checkpoint stores:

  • offsets
  • progress
  • execution state

But file streaming also relies on:

  • directory listing
  • file metadata

Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.


Key takeaway

Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files

If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.

Hope this helps clarify how file streaming works internally.

View solution in original post