2 weeks ago
Hi everyone,
I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.
My setup:
Source folder:
/Volumes/workspace/streaming/stream
I am reading JSON files using Spark Streaming with a predefined schema.
df = spark.readStream \
.schema(schema) \
.option("multiLine","true") \
.json("/Volumes/workspace/streaming/stream")
Then I perform some transformations and write the output to Delta:
transform_df.writeStream \
.format("delta") \
.outputMode("append") \
.trigger(once=True) \
.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \
.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \
.start()
Scenario:
1. I uploaded a JSON file into the source folder.
2. Ran the stream → the data was processed successfully.
3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.
However, when I run the stream again it shows:
Rows read = 0
Bytes written = 0
The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:
PATH_NOT_FOUND
My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.
My questions:
1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?
2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?
3. Is it recommended to keep the output folder outside the source folder when using file streaming?
4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?
Any clarification would help me understand how Spark detects and processes files in this scenario.
Thanks!
2 weeks ago
This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.
Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.
Let me address your questions one by one.
For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:
Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.
So even with:
Spark still sees no new files, because the files already exist in the source directory and have not changed.
That’s why you see:
You have three correct options, depending on intent:
Option A (most common & recommended)
Use batch processing instead of streaming:
This is the cleanest approach when you want to reprocess historical/static files.
Option B
Move or copy the files into a new source directory and start a new stream from there.
Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).
Structured Streaming is not designed for replaying static files.
Streaming is meant for:
If you need replay / backfill / re‑runs:
A very common pattern is:
Raw files → Batch ingest → Bronze Delta Bronze Delta → Streaming / Batch transforms
Delta supports time travel and replay; file streaming does not.
Yes — absolutely recommended.
Keeping output or checkpoint paths inside the source directory can cause:
Best practice:
Because file discovery happens before checkpointing.
Checkpoint stores:
But file streaming also relies on:
Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.
Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files
If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.
Hope this helps clarify how file streaming works internally.
2 weeks ago
This is expected behavior in Spark Structured Streaming, and the key point is that file streaming is not just driven by the checkpoint.
Spark uses file metadata tracking at the source level, not only checkpoint state, to decide whether a file is “new”.
Let me address your questions one by one.
For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:
Once a file is discovered by any streaming query, Spark treats it as already seen for that source path.
This detection is independent of the output path and not fully reset by using a new checkpoint.
So even with:
Spark still sees no new files, because the files already exist in the source directory and have not changed.
That’s why you see:
You have three correct options, depending on intent:
Option A (most common & recommended)
Use batch processing instead of streaming:
This is the cleanest approach when you want to reprocess historical/static files.
Option B
Move or copy the files into a new source directory and start a new stream from there.
Option C
Touch or rewrite the files (change modification time) so Spark treats them as new
(not recommended in practice).
Structured Streaming is not designed for replaying static files.
Streaming is meant for:
If you need replay / backfill / re‑runs:
A very common pattern is:
Raw files → Batch ingest → Bronze Delta Bronze Delta → Streaming / Batch transforms
Delta supports time travel and replay; file streaming does not.
Yes — absolutely recommended.
Keeping output or checkpoint paths inside the source directory can cause:
Best practice:
Because file discovery happens before checkpointing.
Checkpoint stores:
But file streaming also relies on:
Since the files already exist and haven’t changed, Spark simply sees no new input, so the stream does nothing.
Structured Streaming is for continuously arriving data
It is not meant for reprocessing static files
If the goal is reprocessing, experimentation, or transformation changes: Use batch reads or Delta tables, not file streaming.
Hope this helps clarify how file streaming works internally.
a week ago
The checkpoint tracks the structured streaming information including state information and processed records. When you change to a new checkpoint location, the next run begins fresh.
You can create a different Delta file with a new checkpoint & new output location using the same source location. Source change is not required. Keep the output, source & checkpoint folders completely separate.
More details here