Databricks Community

dipali_globant · ‎10-29-2024

we have 25k data which are publishing by batch of 5k.

we are numbering the records based on row_number window function and creating batch using this.

we have observed that some records like 10-20 records are getting published duplicated in 2 offset.

can someone help with the probable cause for this issue.

agallard · ‎10-29-2024

Hi @dipali_globant,

duplicate data in Kafka can arise in a batch processing scenario for a few reasons 🤔

here’s an example of ensuring unique and consistent row numbering:

from pyspark.sql import Window
from pyspark.sql.functions import row_number

window_spec = Window.orderBy("unique_column")  # Replace "unique_column" with a reliable ordering column

df = df.withColumn("row_number", row_number().over(window_spec))

The ROW_NUMBER window function may assign different row numbers across batch executions if there’s no consistent order. This can result in the same records being included in multiple batches, especially if partitions or order change between runs, so you can use a specific column or combination of columns for ordering that is unique and stable (such as timestamps or primary keys).

If you are processing and publishing multiple batches simultaneously, concurrency issues could arise, especially if some data falls between two consecutive batch windows.

check and comments!

Regards

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Databricks Community

duplicate data published in kafka offset

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences