mongodb connector duplicate writes

__Aziz__ — Fri, 05 Dec 2025 12:28:12 GMT

Hi everyone,

Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:

overwrite the collection (very fast),
then create the indexes.

Occasionally, I’m seeing duplicates appear in MongoDB even though the Delta Lake source contains no duplicates. It looks like some Spark tasks sometimes fail and get retried, which leads to the same data being written twice, since there’s no uniqueness constraint at that moment.

Has anyone dealt with this behavior or found a reliable way to prevent duplicates during writes?

Re: mongodb connector duplicate writes

bianca_unifeye — Fri, 05 Dec 2025 12:54:22 GMT

Hi Aziz,

What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.

Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.
From Spark’s perspective this is correct behaviour, but MongoDB has no idea it’s a “retry”, it just sees another insert.

If, at the time of the write:

There is no unique index on the target key
And you’re doing plain inserts (no upsert / idempotent key)

then the same row can be written twice when a task is retried.

Because your Delta source is clean and deduplicated, the issue isn’t in Delta, it’s the at-least-once semantics of the write path.

To fix it:

Introduce a unique, deterministic key and write via upserts / _id so retries are safe, or
Use a staging collection + atomic rename so partial/duplicate states never hit your live collection.

Either approach will eliminate duplicates even when Spark tasks are retried.

topic Re: mongodb connector duplicate writes in Data Engineering

mongodb connector duplicate writes

Re: mongodb connector duplicate writes