<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark Streaming - only process new files in streaming path? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21324#M14524</link>
    <description>&lt;P&gt;Yes exactly &lt;B&gt;cloudFiles.maxFileAge &lt;/B&gt;please select your answer as the best one &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 06 May 2022 15:46:47 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2022-05-06T15:46:47Z</dc:date>
    <item>
      <title>Spark Streaming - only process new files in streaming path?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21322#M14522</link>
      <description>&lt;P&gt;In our streaming jobs, we currently run streaming (cloudFiles format) on a directory with sales transactions coming every 5 minutes.&lt;/P&gt;&lt;P&gt;In this directory, the transactions are ordered in the following format:&lt;/P&gt;&lt;P&gt;&amp;lt;streaming-checkpoint-root&amp;gt;/&amp;lt;transaction_date&amp;gt;/&amp;lt;transaction_hour&amp;gt;/transaction_x_y.json&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Only the transactions of TODAY are of interest, all others are already obsolete.&lt;/P&gt;&lt;P&gt;When I start the streaming job, it will process all the historical transactions, which I don´t want.&lt;/P&gt;&lt;P&gt;Is it somehow possible to process only NEW files coming in after the streaming job has started?&lt;/P&gt;</description>
      <pubDate>Fri, 06 May 2022 11:19:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21322#M14522</guid>
      <dc:creator>Michael_Galli</dc:creator>
      <dc:date>2022-05-06T11:19:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - only process new files in streaming path?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21323#M14523</link>
      <description>&lt;P&gt;Seems that "maxFileAge" solves the problem.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;streaming_df = (&lt;/P&gt;&lt;P&gt;    spark.readStream.format("cloudFiles").option("cloudFiles.format", "json") \&lt;/P&gt;&lt;P&gt;        .option("maxFilesPerTrigger", 20) \&lt;/P&gt;&lt;P&gt;        .option("multiLine", True) \&lt;/P&gt;&lt;P&gt;        .option("maxFileAge", 1) \&lt;/P&gt;&lt;P&gt;        .schema(schema).load(streaming_path)&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This ignores files older than 1 week.&lt;/P&gt;&lt;P&gt;But how to ignore files older than 1 day?&lt;/P&gt;</description>
      <pubDate>Fri, 06 May 2022 13:28:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21323#M14523</guid>
      <dc:creator>Michael_Galli</dc:creator>
      <dc:date>2022-05-06T13:28:33Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - only process new files in streaming path?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21324#M14524</link>
      <description>&lt;P&gt;Yes exactly &lt;B&gt;cloudFiles.maxFileAge &lt;/B&gt;please select your answer as the best one &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 06 May 2022 15:46:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21324#M14524</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-05-06T15:46:47Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - only process new files in streaming path?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21325#M14525</link>
      <description>&lt;P&gt;Update:&lt;/P&gt;&lt;P&gt;Seems that maxFileAge was not a good idea. The following with the option "includeExistingFiles" = False solved my problem:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;streaming_df = (&lt;/P&gt;&lt;P&gt;    spark.readStream.format("cloudFiles")&lt;/P&gt;&lt;P&gt;        .option("cloudFiles.format", extension)&lt;/P&gt;&lt;P&gt;        .option("cloudFiles.maxFilesPerTrigger", 20)&lt;/P&gt;&lt;P&gt;        .option("cloudFiles.includeExistingFiles", False)&lt;/P&gt;&lt;P&gt;        .option("multiLine", True)&lt;/P&gt;&lt;P&gt;        .option("pathGlobfilter", "*."+extension) \&lt;/P&gt;&lt;P&gt;        .schema(schema).load(streaming_path)&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 May 2022 06:00:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-only-process-new-files-in-streaming-path/m-p/21325#M14525</guid>
      <dc:creator>Michael_Galli</dc:creator>
      <dc:date>2022-05-10T06:00:26Z</dc:date>
    </item>
  </channel>
</rss>

