cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Checkpoint issue when loading data from confluent kafka

UmaMahesh1
Honored Contributor III

I have a streaming notebook which fetches messages from confluent Kafka topic and loads them into adls. It is a streaming notebook with the trigger as continuous processing. Before loading the message (which is in Avro format), I'm flattening out the message using some python udf's. While loading into adls, in addition to the Kafka timestamp, I also added one extra timestamp, say LoadTs column using current timestamp to analyze the latency of data load.

The issue I'm facing is that some messages are getting duplicated in adls multiple times.

For example, if I loaded 1000 messages generated on 01-04-2023, then the kafka timestamp will be 01-04-2023 and also the LoadTs will be 01-04-2023.

Out of these 1000 messages, some random number of messages (100, 20, etc..) are again getting written into adls on 10-04-2023. The kafka timestamp for the messages getting duplicated is still 01-04-2023.

I'm assuming that this issue is because of the checkpoint feature.

Did anyone else face the same issue. Or does anyone have any suggestions/ideas on how to avoid that ?

Uma Mahesh D
1 REPLY 1

Avinash_94
New Contributor III

Best approach is to not to depend on Kafka’s commit mechanism! We can store processing result and message offset to external data store in the same database transaction. So, if the database transaction fails, both commit and processing will fail and will be redone again. Otherwise, both will succeed. The table containing offset information can have schema {topic_name, partition_id, offset}.  

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group