cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

duplicate data published in kafka offset

dipali_globant
New Contributor II

we have 25k data which are publishing by batch of 5k.

we are numbering the records based on row_number window function and creating batch using this.

we have observed that some records like 10-20 records are getting published duplicated in 2 offset. 

can someone help with the probable cause for this issue.

1 REPLY 1

agallard
Contributor

Hi @dipali_globant,

duplicate data in Kafka can arise in a batch processing scenario for a few reasons ๐Ÿค”

hereโ€™s an example of ensuring unique and consistent row numbering:

 

from pyspark.sql import Window
from pyspark.sql.functions import row_number

window_spec = Window.orderBy("unique_column")  # Replace "unique_column" with a reliable ordering column

df = df.withColumn("row_number", row_number().over(window_spec))

 

The ROW_NUMBER window function may assign different row numbers across batch executions if thereโ€™s no consistent order. This can result in the same records being included in multiple batches, especially if partitions or order change between runs, so you can use a specific column or combination of columns for ordering that is unique and stable (such as timestamps or primary keys).

If you are processing and publishing multiple batches simultaneously, concurrency issues could arise, especially if some data falls between two consecutive batch windows.

check and comments!

Regards

Alfonso Gallardo
-------------------
๏”ง I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now