cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Checkpoint is getting created even the though the microbatch append has failed

_Orc
New Contributor

Use case

Read data from source table using structured spark streaming(Round the clock).Apply transformation logic etc etc and finally merge the dataframe in the target table.If there is any failure during transformation or merge ,databricks job should not fail.

In my notebook i have the below skeleton code(Please ignore the syntax) and i am running the notebook in a job cluster. So whenever there is failure in handlemicrobatch function i am catching it and sending an alert but i am not throwing the error because we don't want to fail the databricks job.But in doing so the checkpoint gets created even though the data was not processed/written to the target table. Is there a way to avoid checkpoint creation in case of failure without throwing the exception in catch block?

class handlerclass(.......){

def handlemicrobatch(df: DataFrame ,batchid: Long){

try{

apply transformation logics...and finally merge to target

}catch { case e: Exception {

log errors

send emails

}

}

}

}

val classobj = new handlerclass(......);

val checkpoint = dbfs:/checkpoint/<sourcetable>

StartStream(){

val df = readstream ..on source table

val streamingquery = df.writestream

.foreachbatch(classobj.handlemicrobatch _).

.option("checkpointLocation",checkpoint)

............

.start()

}

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Hi @Om Singh​ ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

View solution in original post

2 REPLIES 2

jose_gonzalez
Moderator
Moderator

Hi @Om Singh​ ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

Anonymous
Not applicable

Hi @Om Singh​ 

Hope you are doing well. Just wanted to check in and see if you were able to find a solution to your question?

Cheers

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group