Hi @dipali_globant,
duplicate data in Kafka can arise in a batch processing scenario for a few reasons 🤔
here’s an example of ensuring unique and consistent row numbering:
from pyspark.sql import Window
from pyspark.sql.functions import row_number
window_spec = Window.orderBy("unique_column") # Replace "unique_column" with a reliable ordering column
df = df.withColumn("row_number", row_number().over(window_spec))
The ROW_NUMBER window function may assign different row numbers across batch executions if there’s no consistent order. This can result in the same records being included in multiple batches, especially if partitions or order change between runs, so you can use a specific column or combination of columns for ordering that is unique and stable (such as timestamps or primary keys).
If you are processing and publishing multiple batches simultaneously, concurrency issues could arise, especially if some data falls between two consecutive batch windows.
check and comments!
Regards
Alfonso Gallardo
-------------------
I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark