cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Streaming – Old file not processed with new checkpoint and new output path

RisabhRawat
New Contributor

Hi everyone,

I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.

My setup:

Source folder:
/Volumes/workspace/streaming/stream

I am reading JSON files using Spark Streaming with a predefined schema.

df = spark.readStream \
.schema(schema) \
.option("multiLine","true") \
.json("/Volumes/workspace/streaming/stream")

Then I perform some transformations and write the output to Delta:

transform_df.writeStream \
.format("delta") \
.outputMode("append") \
.trigger(once=True) \
.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \
.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \
.start()

Scenario:

1. I uploaded a JSON file into the source folder.
2. Ran the stream → the data was processed successfully.
3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.

However, when I run the stream again it shows:

Rows read = 0
Bytes written = 0

The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:

PATH_NOT_FOUND

My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.

My questions:

1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?
2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?
3. Is it recommended to keep the output folder outside the source folder when using file streaming?
4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?

Any clarification would help me understand how Spark detects and processes files in this scenario.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Mridu
New Contributor II

This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.

Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.

Let me address your questions one by one.


Why this happens (important concept)

For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:

  • file path
  • file name
  • file modification timestamp

Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.

So even with:

  • a new checkpoint location 
  • a new output Delta path 

Spark still sees no new files, because the files already exist in the source directory and have not changed.

That’s why you see:

  • Rows read = 0
  • output path not created

1. How to process the same input files again with modified transformations?

You have three correct options, depending on intent:

Option A (most common & recommended)
Use batch processing instead of streaming:

 
df = spark.read.schema(schema).json("/Volumes/workspace/streaming/stream")

This is the cleanest approach when you want to reprocess historical/static files.

Option B
Move or copy the files into a new source directory and start a new stream from there.

Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).


2. How to reprocess existing files using Spark Streaming?

Structured Streaming is not designed for replaying static files.

Streaming is meant for:

  • append‑only data
  • new files arriving over time

If you need replay / backfill / re‑runs:

  • use batch jobs
  • or read from Delta tables (not raw files)

A very common pattern is:

Raw files → Batch ingest → Bronze Delta
Bronze Delta → Streaming / Batch transforms

Delta supports time travel and replay; file streaming does not.


3. Should the output folder be outside the source folder?

Yes — absolutely recommended.

Keeping output or checkpoint paths inside the source directory can cause:

  • file discovery confusion
  • recursive reads
  • unexpected behavior

Best practice:

  • Source path: input‑only
  • Output path: separate directory
  • Checkpoint path: separate directory

4. If I change output path & checkpoint, how does Spark still know the file was processed?

Because file discovery happens before checkpointing.

Checkpoint stores:

  • offsets
  • progress
  • execution state

But file streaming also relies on:

  • directory listing
  • file metadata

Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.


Key takeaway

Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files

If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.

Hope this helps clarify how file streaming works internally.

View solution in original post

2 REPLIES 2

Mridu
New Contributor II

This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.

Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.

Let me address your questions one by one.


Why this happens (important concept)

For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:

  • file path
  • file name
  • file modification timestamp

Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.

So even with:

  • a new checkpoint location 
  • a new output Delta path 

Spark still sees no new files, because the files already exist in the source directory and have not changed.

That’s why you see:

  • Rows read = 0
  • output path not created

1. How to process the same input files again with modified transformations?

You have three correct options, depending on intent:

Option A (most common & recommended)
Use batch processing instead of streaming:

 
df = spark.read.schema(schema).json("/Volumes/workspace/streaming/stream")

This is the cleanest approach when you want to reprocess historical/static files.

Option B
Move or copy the files into a new source directory and start a new stream from there.

Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).


2. How to reprocess existing files using Spark Streaming?

Structured Streaming is not designed for replaying static files.

Streaming is meant for:

  • append‑only data
  • new files arriving over time

If you need replay / backfill / re‑runs:

  • use batch jobs
  • or read from Delta tables (not raw files)

A very common pattern is:

Raw files → Batch ingest → Bronze Delta
Bronze Delta → Streaming / Batch transforms

Delta supports time travel and replay; file streaming does not.


3. Should the output folder be outside the source folder?

Yes — absolutely recommended.

Keeping output or checkpoint paths inside the source directory can cause:

  • file discovery confusion
  • recursive reads
  • unexpected behavior

Best practice:

  • Source path: input‑only
  • Output path: separate directory
  • Checkpoint path: separate directory

4. If I change output path & checkpoint, how does Spark still know the file was processed?

Because file discovery happens before checkpointing.

Checkpoint stores:

  • offsets
  • progress
  • execution state

But file streaming also relies on:

  • directory listing
  • file metadata

Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.


Key takeaway

Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files

If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.

Hope this helps clarify how file streaming works internally.

balajij8
Contributor

The checkpoint tracks the structured streaming information including state information and processed records. When you change to a new checkpoint location, the next run begins fresh. 

You can create a different Delta file with a new checkpoint & new output location using the same source location. Source change is not required. Keep the output, source & checkpoint folders completely separate. 

More details here