cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Checkpoint issue when loading data from confluent kafka

UmaMahesh1
Honored Contributor III

I have a streaming notebook which fetches messages from confluent Kafka topic and loads them into adls. It is a streaming notebook with the trigger as continuous processing. Before loading the message (which is in Avro format), I'm flattening out the message using some python udf's. While loading into adls, in addition to the Kafka timestamp, I also added one extra timestamp, say LoadTs column using current timestamp to analyze the latency of data load.

The issue I'm facing is that some messages are getting duplicated in adls multiple times.

For example, if I loaded 1000 messages generated on 01-04-2023, then the kafka timestamp will be 01-04-2023 and also the LoadTs will be 01-04-2023.

Out of these 1000 messages, some random number of messages (100, 20, etc..) are again getting written into adls on 10-04-2023. The kafka timestamp for the messages getting duplicated is still 01-04-2023.

I'm assuming that this issue is because of the checkpoint feature.

Did anyone else face the same issue. Or does anyone have any suggestions/ideas on how to avoid that ?

1 REPLY 1

Avinash_94
New Contributor III

Best approach is to not to depend on Kafka’s commit mechanism! We can store processing result and message offset to external data store in the same database transaction. So, if the database transaction fails, both commit and processing will fail and will be redone again. Otherwise, both will succeed. The table containing offset information can have schema {topic_name, partition_id, offset}.  

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.