cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

AvailableNow Trigger and failure

Maatari
New Contributor III

Hi, 

I wonder what is the supposed to be the behavior of spark structured streaming when using the AvailableNow Trigger and there is a query failure during the query ? More specifically, what happens to the initial end offset set ? Does it change ? While it is clear that using checkpointing the query would resume where it left off, but what happens to the end offset ? To some degree it almost amount to ask if spark structured streaming make difference between a failure and the end of the query ? 

3 REPLIES 3

Walter_C
Databricks Employee
Databricks Employee

When using the AvailableNow trigger in Spark Structured Streaming, the behavior during a query failure is as follows:

  1. End Offset: The initial end offset set by the AvailableNow trigger does not change due to a query failure. The AvailableNow trigger processes all available data up to a specific point in time, and this end offset remains fixed even if the query fails.

  2. Query Resumption: If checkpointing is enabled, the query will resume from where it left off upon recovery. This means that the processing will continue from the last successfully processed offset, not from the beginning. The end offset remains the same as initially set by the AvailableNow trigger.

  3. Failure vs. End of Query: Spark Structured Streaming does differentiate between a query failure and the end of the query. A failure means the query did not complete successfully, and upon recovery, it will continue processing from the last checkpoint. The end of the query, in the context of AvailableNow, means that all data up to the specified end offset has been processed.

In summary, the end offset set by the AvailableNow trigger remains unchanged during a query failure, and the query will resume from the last checkpointed position upon recovery.

Maatari
New Contributor III

Thank you so much this is really a helpful answer. 

If i may, i would like to understand a bit further the mechanics under the hood. I wonder if it is possible to share the classes involve in this. How the AvailableNow Trigger is able to set a context that makes it that when a query start it is known that the end offset was not processed and therefore we are probably in a failure scenario, vs well the end offset was consumed hence this is a new run so i can refetch a new end offset. The interplay might be coming from somewhere else, i don't know, but i am keep on learning a bit further, getting a sense of where to look for those things. 

Walter_C
Databricks Employee
Databricks Employee

The AvailableNow trigger processes all available data as a single batch and then stops. This is different from continuous or micro-batch processing where the system continuously checks for new data. When a query starts with the AvailableNow trigger, it determines whether the end offset (the point up to which data has been processed) was previously processed. If the end offset was not processed, it indicates a failure scenario, and the system will attempt to reprocess the data from the last known successful offset. If the end offset was consumed, it signifies a new run, and the system will fetch a new end offset to process the next batch of data.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group