cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

duplicate data published in kafka offset

dipali_globant
New Contributor II

we have 25k data which are publishing by batch of 5k.

we are numbering the records based on row_number window function and creating batch using this.

we have observed that some records like 10-20 records are getting published duplicated in 2 offset. 

can someone help with the probable cause for this issue.

1 REPLY 1

agallardrivilla
New Contributor II

Hi @dipali_globant,

duplicate data in Kafka can arise in a batch processing scenario for a few reasons 🤔

here’s an example of ensuring unique and consistent row numbering:

 

from pyspark.sql import Window
from pyspark.sql.functions import row_number

window_spec = Window.orderBy("unique_column")  # Replace "unique_column" with a reliable ordering column

df = df.withColumn("row_number", row_number().over(window_spec))

 

The ROW_NUMBER window function may assign different row numbers across batch executions if there’s no consistent order. This can result in the same records being included in multiple batches, especially if partitions or order change between runs, so you can use a specific column or combination of columns for ordering that is unique and stable (such as timestamps or primary keys).

If you are processing and publishing multiple batches simultaneously, concurrency issues could arise, especially if some data falls between two consecutive batch windows.

check and comments!

Regards

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group