Databricks Community

RisabhRawat · 2 weeks ago

Hi everyone,

I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.

My setup:

Source folder:
/Volumes/workspace/streaming/stream

I am reading JSON files using Spark Streaming with a predefined schema.

df = spark.readStream \
.schema(schema) \
.option("multiLine","true") \
.json("/Volumes/workspace/streaming/stream")

Then I perform some transformations and write the output to Delta:

transform_df.writeStream \
.format("delta") \
.outputMode("append") \
.trigger(once=True) \
.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \
.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \
.start()

Scenario:

1. I uploaded a JSON file into the source folder.
2. Ran the stream → the data was processed successfully.
3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.

However, when I run the stream again it shows:

Rows read = 0
Bytes written = 0

The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:

PATH_NOT_FOUND

My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.

My questions:

1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?
2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?
3. Is it recommended to keep the output folder outside the source folder when using file streaming?
4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?

Any clarification would help me understand how Spark detects and processes files in this scenario.

Thanks!

Mridu · 2 weeks ago

This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.

Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.

Let me address your questions one by one.

Why this happens (important concept)

For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:

file path
file name
file modification timestamp

Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.

So even with:

a new checkpoint location
a new output Delta path

Spark still sees no new files, because the files already exist in the source directory and have not changed.

That’s why you see:

Rows read = 0
output path not created

1. How to process the same input files again with modified transformations?

You have three correct options, depending on intent:

Option A (most common & recommended)
Use batch processing instead of streaming:

df = spark.read.schema(schema).json("/Volumes/workspace/streaming/stream")

This is the cleanest approach when you want to reprocess historical/static files.

Option B
Move or copy the files into a new source directory and start a new stream from there.

Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).

2. How to reprocess existing files using Spark Streaming?

Structured Streaming is not designed for replaying static files.

Streaming is meant for:

append‑only data
new files arriving over time

If you need replay / backfill / re‑runs:

use batch jobs
or read from Delta tables (not raw files)

A very common pattern is:

Raw files → Batch ingest → Bronze Delta
Bronze Delta → Streaming / Batch transforms

Delta supports time travel and replay; file streaming does not.

3. Should the output folder be outside the source folder?

Yes — absolutely recommended.

Keeping output or checkpoint paths inside the source directory can cause:

file discovery confusion
recursive reads
unexpected behavior

Best practice:

Source path: input‑only
Output path: separate directory
Checkpoint path: separate directory

4. If I change output path & checkpoint, how does Spark still know the file was processed?

Because file discovery happens before checkpointing.

Checkpoint stores:

offsets
progress
execution state

But file streaming also relies on:

directory listing
file metadata

Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.

Key takeaway

Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files

If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.

Hope this helps clarify how file streaming works internally.

View solution in original post

Mridu · 2 weeks ago