Databricks Community

UmaMahesh1 · ‎04-11-2023

I have a streaming notebook which fetches messages from confluent Kafka topic and loads them into adls. It is a streaming notebook with the trigger as continuous processing. Before loading the message (which is in Avro format), I'm flattening out the message using some python udf's. While loading into adls, in addition to the Kafka timestamp, I also added one extra timestamp, say LoadTs column using current timestamp to analyze the latency of data load.

The issue I'm facing is that some messages are getting duplicated in adls multiple times.

For example, if I loaded 1000 messages generated on 01-04-2023, then the kafka timestamp will be 01-04-2023 and also the LoadTs will be 01-04-2023.

Out of these 1000 messages, some random number of messages (100, 20, etc..) are again getting written into adls on 10-04-2023. The kafka timestamp for the messages getting duplicated is still 01-04-2023.

I'm assuming that this issue is because of the checkpoint feature.

Did anyone else face the same issue. Or does anyone have any suggestions/ideas on how to avoid that ?

Uma Mahesh D

Avinash_94 · ‎04-14-2023

Best approach is to not to depend on Kafka’s commit mechanism! We can store processing result and message offset to external data store in the same database transaction. So, if the database transaction fails, both commit and processing will fail and will be redone again. Otherwise, both will succeed. The table containing offset information can have schema {topic_name, partition_id, offset}.

Databricks Community

Checkpoint issue when loading data from confluent kafka

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch