Streaming with runOnce and groupBy window queries
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-28-2021 09:04 AM
I have a streaming job running a groupBy query with a Window of 3 days. The query is searching for different types of events.
The stream is configured with runOnce and there is a job scheduled for every hour.
Now, I'm not sure what data is processed each time the stream is triggered if there is only one new event and there are other relevant events for the window timeframe but the events were already processed in the previous run. Is it going to look only at the new data? or all the relevant data for the query in the window?
- Labels:
-
Groupby Window Queries
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-28-2021 09:16 AM
In my opinion groupBy with stream once will be only for new data as it will use offset from checkpoint and old data will be not available.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-29-2021 10:54 AM
Hi @itay k ,
You will need to take a look at the Progress Reporter. This will show the Micro-batch JSON metrics. For example, the metric called "numInputRows" which will display the number of input rows that it processed for the micro-batch. You will find these metrics in the driver logs --> log4j
In addition, the following article will show how are these Streaming metrics mean and how to access/view them https://databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0...

