03-02-2022 12:19 PM
Use case
Read data from source table using structured spark streaming(Round the clock).Apply transformation logic etc etc and finally merge the dataframe in the target table.If there is any failure during transformation or merge ,databricks job should not fail.
In my notebook i have the below skeleton code(Please ignore the syntax) and i am running the notebook in a job cluster. So whenever there is failure in handlemicrobatch function i am catching it and sending an alert but i am not throwing the error because we don't want to fail the databricks job.But in doing so the checkpoint gets created even though the data was not processed/written to the target table. Is there a way to avoid checkpoint creation in case of failure without throwing the exception in catch block?
class handlerclass(.......){
def handlemicrobatch(df: DataFrame ,batchid: Long){
try{
apply transformation logics...and finally merge to target
}catch { case e: Exception {
log errors
send emails
}
}
}
}
val classobj = new handlerclass(......);
val checkpoint = dbfs:/checkpoint/<sourcetable>
StartStream(){
val df = readstream ..on source table
val streamingquery = df.writestream
.foreachbatch(classobj.handlemicrobatch _).
.option("checkpointLocation",checkpoint)
............
.start()
}
03-07-2022 11:19 AM
Hi @Om Singh ,
Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview
03-07-2022 11:19 AM
Hi @Om Singh ,
Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview
04-12-2022 09:34 AM
Hi @Om Singh
Hope you are doing well. Just wanted to check in and see if you were able to find a solution to your question?
Cheers
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.