cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reset committed offset of spark streaming to capture missed data

DatabricksUser5
Visitor

I have a very straightforward setup between Azure Eventhub and DLT using the kafka endpoint through spark streaming.

There were network issues and the stream didn't pick up some event, but still progressed (and committed) the offset for some reason

As such, the DLT now picks up any new data coming into the eventhub, but not the events that arrived prior to the network issue being resolved

Is there a way to force reset the offset of the spark reader to always be earliest? At the moment, setting the offset desired does not work as there already is a committed offset to be used, but I want to override that

 

Alternative would be to create a new partition and move events that were not picked up there, or re-ingest the events that are prior to the committed offset, but that's really not elegant imo 

0 REPLIES 0