cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Streaming with runOnce and groupBy window queries

itay
New Contributor II

I have a streaming job running a groupBy query with a Window of 3 days. The query is searching for different types of events.

The stream is configured with runOnce and there is a job scheduled for every hour.

Now, I'm not sure what data is processed each time the stream is triggered if there is only one new event and there are other relevant events for the window timeframe but the events were already processed in the previous run. Is it going to look only at the new data? or all the relevant data for the query in the window?

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

In my opinion groupBy with stream once will be only for new data as it will use offset from checkpoint and old data will be not available.

jose_gonzalez
Moderator
Moderator

Hi @itay k​ ,

You will need to take a look at the Progress Reporter. This will show the Micro-batch JSON metrics. For example, the metric called "numInputRows" which will display the number of input rows that it processed for the micro-batch. You will find these metrics in the driver logs --> log4j

In addition, the following article will show how are these Streaming metrics mean and how to access/view them https://databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.