<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Spark Streaming – Old file not processed with new checkpoint and new output path in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150881#M53541</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.&lt;/P&gt;&lt;P&gt;My setup:&lt;/P&gt;&lt;P&gt;Source folder:&lt;BR /&gt;/Volumes/workspace/streaming/stream&lt;/P&gt;&lt;P&gt;I am reading JSON files using Spark Streaming with a predefined schema.&lt;/P&gt;&lt;P&gt;df = spark.readStream \&lt;BR /&gt;.schema(schema) \&lt;BR /&gt;.option("multiLine","true") \&lt;BR /&gt;.json("/Volumes/workspace/streaming/stream")&lt;/P&gt;&lt;P&gt;Then I perform some transformations and write the output to Delta:&lt;/P&gt;&lt;P&gt;transform_df.writeStream \&lt;BR /&gt;.format("delta") \&lt;BR /&gt;.outputMode("append") \&lt;BR /&gt;.trigger(once=True) \&lt;BR /&gt;.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \&lt;BR /&gt;.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \&lt;BR /&gt;.start()&lt;/P&gt;&lt;P&gt;Scenario:&lt;/P&gt;&lt;P&gt;1. I uploaded a JSON file into the source folder.&lt;BR /&gt;2. Ran the stream → the data was processed successfully.&lt;BR /&gt;3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.&lt;/P&gt;&lt;P&gt;However, when I run the stream again it shows:&lt;/P&gt;&lt;P&gt;Rows read = 0&lt;BR /&gt;Bytes written = 0&lt;/P&gt;&lt;P&gt;The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:&lt;/P&gt;&lt;P&gt;PATH_NOT_FOUND&lt;/P&gt;&lt;P&gt;My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.&lt;/P&gt;&lt;P&gt;My questions:&lt;/P&gt;&lt;P&gt;1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?&lt;BR /&gt;2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?&lt;BR /&gt;3. Is it recommended to keep the output folder outside the source folder when using file streaming?&lt;BR /&gt;4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?&lt;/P&gt;&lt;P&gt;Any clarification would help me understand how Spark detects and processes files in this scenario.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Sat, 14 Mar 2026 07:18:34 GMT</pubDate>
    <dc:creator>RisabhRawat</dc:creator>
    <dc:date>2026-03-14T07:18:34Z</dc:date>
    <item>
      <title>Spark Streaming – Old file not processed with new checkpoint and new output path</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150881#M53541</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I am a Data Engineer and currently practicing Spark Streaming in Databricks. I am trying to understand how file streaming behaves with checkpoints and how Spark detects new files.&lt;/P&gt;&lt;P&gt;My setup:&lt;/P&gt;&lt;P&gt;Source folder:&lt;BR /&gt;/Volumes/workspace/streaming/stream&lt;/P&gt;&lt;P&gt;I am reading JSON files using Spark Streaming with a predefined schema.&lt;/P&gt;&lt;P&gt;df = spark.readStream \&lt;BR /&gt;.schema(schema) \&lt;BR /&gt;.option("multiLine","true") \&lt;BR /&gt;.json("/Volumes/workspace/streaming/stream")&lt;/P&gt;&lt;P&gt;Then I perform some transformations and write the output to Delta:&lt;/P&gt;&lt;P&gt;transform_df.writeStream \&lt;BR /&gt;.format("delta") \&lt;BR /&gt;.outputMode("append") \&lt;BR /&gt;.trigger(once=True) \&lt;BR /&gt;.option("path","/Volumes/workspace/streaming/stream/delta/Datasets") \&lt;BR /&gt;.option("checkpointLocation","/Volumes/workspace/streaming/stream/checkpoint/new_checkpoint") \&lt;BR /&gt;.start()&lt;/P&gt;&lt;P&gt;Scenario:&lt;/P&gt;&lt;P&gt;1. I uploaded a JSON file into the source folder.&lt;BR /&gt;2. Ran the stream → the data was processed successfully.&lt;BR /&gt;3. Then I tried to process the same file again but write it to a different Delta location with a new checkpoint.&lt;/P&gt;&lt;P&gt;However, when I run the stream again it shows:&lt;/P&gt;&lt;P&gt;Rows read = 0&lt;BR /&gt;Bytes written = 0&lt;/P&gt;&lt;P&gt;The checkpoint folder gets created, but the output Delta folder is not created. When I try to query it, I get:&lt;/P&gt;&lt;P&gt;PATH_NOT_FOUND&lt;/P&gt;&lt;P&gt;My expectation was that since I am using a new checkpoint and a new output path, Spark should process the existing file again.&lt;/P&gt;&lt;P&gt;My questions:&lt;/P&gt;&lt;P&gt;1. If someone wants to process the same input files again and store them in a different location with modified transformations, what is the correct approach?&lt;BR /&gt;2. If I want to reprocess existing files using Spark Streaming, how should I configure the checkpoint or source?&lt;BR /&gt;3. Is it recommended to keep the output folder outside the source folder when using file streaming?&lt;BR /&gt;4. If I change the output path and create a new checkpoint location, how does Spark still know that the file has already been processed?&lt;/P&gt;&lt;P&gt;Any clarification would help me understand how Spark detects and processes files in this scenario.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 14 Mar 2026 07:18:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150881#M53541</guid>
      <dc:creator>RisabhRawat</dc:creator>
      <dc:date>2026-03-14T07:18:34Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming – Old file not processed with new checkpoint and new output path</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150886#M53542</link>
      <description>&lt;DIV&gt;&lt;P&gt;This is expected behavior in Spark Structured Streaming, and the key point is that &lt;STRONG&gt;file streaming is not just driven by the checkpoint&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Spark uses &lt;STRONG&gt;file metadata tracking at the source level&lt;/STRONG&gt;, not only checkpoint state, to decide whether a file is “new”.&lt;/P&gt;&lt;P&gt;Let me address your questions one by one.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;Why this happens (important concept)&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;For file sources (readStream.format("json"/"csv"/etc.)), Spark tracks:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;file path&lt;/LI&gt;&lt;LI&gt;file name&lt;/LI&gt;&lt;LI&gt;file modification timestamp&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Once a file is discovered by &lt;STRONG&gt;any streaming query&lt;/STRONG&gt;, Spark treats it as &lt;STRONG&gt;already seen&lt;/STRONG&gt; for that source path.&lt;BR /&gt;This detection is &lt;STRONG&gt;independent of the output path&lt;/STRONG&gt; and &lt;STRONG&gt;not fully reset by using a new checkpoint&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;So even with:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;a new checkpoint location&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;a new output Delta path&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Spark still sees &lt;STRONG&gt;no new files&lt;/STRONG&gt;, because the files already exist in the source directory and have not changed.&lt;/P&gt;&lt;P&gt;That’s why you see:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Rows read = 0&lt;/LI&gt;&lt;LI&gt;output path not created&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;1. How to process the same input files again with modified transformations?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;You have &lt;STRONG&gt;three correct options&lt;/STRONG&gt;, depending on intent:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option A (most common &amp;amp; recommended)&lt;/STRONG&gt;&lt;BR /&gt;Use &lt;STRONG&gt;batch processing&lt;/STRONG&gt; instead of streaming:&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;df = spark.read.schema(schema).json("/Volumes/workspace/streaming/stream")&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;This is the cleanest approach when you want to reprocess historical/static files.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option B&lt;/STRONG&gt;&lt;BR /&gt;Move or copy the files into a &lt;STRONG&gt;new source directory&lt;/STRONG&gt; and start a new stream from there.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option C&lt;/STRONG&gt;&lt;BR /&gt;Touch or rewrite the files (change modification time) so Spark treats them as new&lt;BR /&gt;(not recommended in practice).&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;2. How to reprocess existing files using Spark Streaming?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;Structured Streaming &lt;STRONG&gt;is not designed for replaying static files&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Streaming is meant for:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;append‑only data&lt;/LI&gt;&lt;LI&gt;new files arriving over time&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you need replay / backfill / re‑runs:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;use &lt;STRONG&gt;batch jobs&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;or read from &lt;STRONG&gt;Delta tables&lt;/STRONG&gt; (not raw files)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;A very common pattern is:&lt;/P&gt;&lt;PRE&gt;Raw files → Batch ingest → Bronze Delta
Bronze Delta → Streaming / Batch transforms&lt;/PRE&gt;&lt;P&gt;Delta supports time travel and replay; file streaming does not.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;3. Should the output folder be outside the source folder?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;&lt;STRONG&gt;Yes — absolutely recommended.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Keeping output or checkpoint paths &lt;STRONG&gt;inside the source directory&lt;/STRONG&gt; can cause:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;file discovery confusion&lt;/LI&gt;&lt;LI&gt;recursive reads&lt;/LI&gt;&lt;LI&gt;unexpected behavior&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Best practice:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Source path: input‑only&lt;/LI&gt;&lt;LI&gt;Output path: separate directory&lt;/LI&gt;&lt;LI&gt;Checkpoint path: separate directory&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;4. If I change output path &amp;amp; checkpoint, how does Spark still know the file was processed?&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;Because &lt;STRONG&gt;file discovery happens before checkpointing&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Checkpoint stores:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;offsets&lt;/LI&gt;&lt;LI&gt;progress&lt;/LI&gt;&lt;LI&gt;execution state&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;But file streaming also relies on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;directory listing&lt;/LI&gt;&lt;LI&gt;file metadata&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Since the files already exist and haven’t changed, Spark simply sees &lt;STRONG&gt;no new input&lt;/STRONG&gt;, so the stream does nothing.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;&lt;STRONG&gt;Key takeaway&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;&lt;STRONG&gt;Structured Streaming is for continuously arriving data&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;It is not meant for reprocessing static files&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;If the goal is reprocessing, experimentation, or transformation changes: &lt;STRONG&gt;Use batch reads or Delta tables&lt;/STRONG&gt;, not file streaming.&lt;/P&gt;&lt;P&gt;Hope this helps clarify how file streaming works internally.&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 14 Mar 2026 10:52:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150886#M53542</guid>
      <dc:creator>Mridu</dc:creator>
      <dc:date>2026-03-14T10:52:11Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming – Old file not processed with new checkpoint and new output path</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150927#M53543</link>
      <description>&lt;P&gt;The checkpoint tracks the structured streaming information including state information and processed records. When you change to a new checkpoint location, the next run begins fresh.&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can create a different Delta file with a new checkpoint &amp;amp; new output location using the same source location. Source change is not required.&amp;nbsp;Keep the output, source &amp;amp; checkpoint folders completely separate.&amp;nbsp;&lt;/P&gt;&lt;P&gt;More details &lt;A href="https://docs.databricks.com/aws/en/structured-streaming/checkpoints" target="_blank"&gt;here&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 14 Mar 2026 16:38:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-old-file-not-processed-with-new-checkpoint-and/m-p/150927#M53543</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-03-14T16:38:08Z</dc:date>
    </item>
  </channel>
</rss>

