cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Streaming with runOnce and groupBy window queries

itay
New Contributor II

I have a streaming job running a groupBy query with a Window of 3 days. The query is searching for different types of events.

The stream is configured with runOnce and there is a job scheduled for every hour.

Now, I'm not sure what data is processed each time the stream is triggered if there is only one new event and there are other relevant events for the window timeframe but the events were already processed in the previous run. Is it going to look only at the new data? or all the relevant data for the query in the window?

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

In my opinion groupBy with stream once will be only for new data as it will use offset from checkpoint and old data will be not available.

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @itay k​ ,

You will need to take a look at the Progress Reporter. This will show the Micro-batch JSON metrics. For example, the metric called "numInputRows" which will display the number of input rows that it processed for the micro-batch. You will find these metrics in the driver logs --> log4j

In addition, the following article will show how are these Streaming metrics mean and how to access/view them https://databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group