cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Can I reset the checkpoint of a streaming job if I want to do a full reload of a table?

User16826992666
Valued Contributor
 
1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

If the read stream definition has something similar to

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")

resetting the checkpoint would attempt to read from the earliest record inside the topic. Now, whether this would result in the full reload of the table would be a function of retention.ms of the topic. If there are are records that have already been expired from kafka, they won't be reprocessed.

View solution in original post

2 REPLIES 2

sajith_appukutt
Honored Contributor II

If the read stream definition has something similar to

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")

resetting the checkpoint would attempt to read from the earliest record inside the topic. Now, whether this would result in the full reload of the table would be a function of retention.ms of the topic. If there are are records that have already been expired from kafka, they won't be reprocessed.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.