cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Checkpoint is getting created even the though the microbatch append has failed

_Orc
New Contributor

Use case

Read data from source table using structured spark streaming(Round the clock).Apply transformation logic etc etc and finally merge the dataframe in the target table.If there is any failure during transformation or merge ,databricks job should not fail.

In my notebook i have the below skeleton code(Please ignore the syntax) and i am running the notebook in a job cluster. So whenever there is failure in handlemicrobatch function i am catching it and sending an alert but i am not throwing the error because we don't want to fail the databricks job.But in doing so the checkpoint gets created even though the data was not processed/written to the target table. Is there a way to avoid checkpoint creation in case of failure without throwing the exception in catch block?

class handlerclass(.......){

def handlemicrobatch(df: DataFrame ,batchid: Long){

try{

apply transformation logics...and finally merge to target

}catch { case e: Exception {

log errors

send emails

}

}

}

}

val classobj = new handlerclass(......);

val checkpoint = dbfs:/checkpoint/<sourcetable>

StartStream(){

val df = readstream ..on source table

val streamingquery = df.writestream

.foreachbatch(classobj.handlemicrobatch _).

.option("checkpointLocation",checkpoint)

............

.start()

}

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Hi @Om Singh​ ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

View solution in original post

2 REPLIES 2

jose_gonzalez
Moderator
Moderator

Hi @Om Singh​ ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

Anonymous
Not applicable

Hi @Om Singh​ 

Hope you are doing well. Just wanted to check in and see if you were able to find a solution to your question?

Cheers

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.