cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Streaming Kafka data without duplication

mddheeraj
New Contributor

Hello,

We are creating an application to read data from Kafka topic send by a source. After we get the data, we do some transformations and send to other Kafka topic. In this process source may send same data twice.

Our questions are

1. How can we control duplications and only send the updated data to target Kafka topic?

2. Where and what format should we store the data in Databricks to check for duplicates?

Thank You,

Dheeraj

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @mddheeraj

  1. To control duplications and ensure only updated data is sent to the target Kafka topic, you can enable idempotence in your Kafka producer to ensure that the same message is not sent more than once. This can be done by setting enable.idempotence=true in the producer configuration.
  2. You can also, implement deduplication logic on the consumer side. You can use a unique identifier (e.g., a UUID) for each message and store these identifiers in a persistent storage (like a database). Before processing a message, check if its identifier already exists in the storage. If it does, skip processing; otherwise, process and store the identifier.
  3. You can also use Kafka Streams to process and de-duplicate messages. Kafka Streams can maintain state stores to keep track of processed messages and ensure that duplicat...
  4. If you store data in Databricks to check for duplicates, consider using Delta Lake. Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It can help manage duplicates by providing features like upserts (merge) and time travel. You can store your data in Delta format and use Delta Lakeโ€™s capabilities to check for and handle duplicates.
  5. You can also try to store the data in a structured format like Parquet or Delta. 

If you have any more questions or need further assistance, feel free to ask! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group