I have a streaming notebook which fetches messages from confluent Kafka topic and loads them into adls. It is a streaming notebook with the trigger as continuous processing. Before loading the message (which is in Avro format), I'm flattening out the message using some python udf's. While loading into adls, in addition to the Kafka timestamp, I also added one extra timestamp, say LoadTs column using current timestamp to analyze the latency of data load.
The issue I'm facing is that some messages are getting duplicated in adls multiple times.
For example, if I loaded 1000 messages generated on 01-04-2023, then the kafka timestamp will be 01-04-2023 and also the LoadTs will be 01-04-2023.
Out of these 1000 messages, some random number of messages (100, 20, etc..) are again getting written into adls on 10-04-2023. The kafka timestamp for the messages getting duplicated is still 01-04-2023.
I'm assuming that this issue is because of the checkpoint feature.
Did anyone else face the same issue. Or does anyone have any suggestions/ideas on how to avoid that ?
Uma Mahesh D