Databricks Community

_Orc · ‎03-02-2022

Use case

Read data from source table using structured spark streaming(Round the clock).Apply transformation logic etc etc and finally merge the dataframe in the target table.If there is any failure during transformation or merge ,databricks job should not fail.

In my notebook i have the below skeleton code(Please ignore the syntax) and i am running the notebook in a job cluster. So whenever there is failure in handlemicrobatch function i am catching it and sending an alert but i am not throwing the error because we don't want to fail the databricks job.But in doing so the checkpoint gets created even though the data was not processed/written to the target table. Is there a way to avoid checkpoint creation in case of failure without throwing the exception in catch block?

class handlerclass(.......){

def handlemicrobatch(df: DataFrame ,batchid: Long){

try{

apply transformation logics...and finally merge to target

}catch { case e: Exception {

log errors

send emails

}

val classobj = new handlerclass(......);

val checkpoint = dbfs:/checkpoint/<sourcetable>

StartStream(){

val df = readstream ..on source table

val streamingquery = df.writestream

.foreachbatch(classobj.handlemicrobatch _).

.option("checkpointLocation",checkpoint)

............

.start()

}

jose_gonzalez · ‎03-07-2022

Hi @Om Singh ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

View solution in original post

jose_gonzalez · ‎03-07-2022

Hi @Om Singh ,

Your checkpoint will help you to recover from failures, so your checkpoint will be created if your streaming job fails or not. Your checkpoint will have the offsets and many other metrics that are needed for your streaming job to be able to work correctly. You can find more details on how the checkpoint works and what are the benefits of using it, https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview