cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

mongodb connector duplicate writes

__Aziz__
Visitor

Hi everyone,


Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:

  1. overwrite the collection (very fast),

  2. then create the indexes.

Occasionally, I’m seeing duplicates appear in MongoDB even though the Delta Lake source contains no duplicates. It looks like some Spark tasks sometimes fail and get retried, which leads to the same data being written twice, since there’s no uniqueness constraint at that moment.

Has anyone dealt with this behavior or found a reliable way to prevent duplicates during writes?

1 ACCEPTED SOLUTION

Accepted Solutions

bianca_unifeye
New Contributor III

Hi Aziz,

What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.

Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.
From Spark’s perspective this is correct behaviour, but MongoDB has no idea it’s a ā€œretryā€, it just sees another insert.

If, at the time of the write:

  • There is no unique index on the target key

  • And you’re doing plain inserts (no upsert / idempotent key)

then the same row can be written twice when a task is retried.

Because your Delta source is clean and deduplicated, the issue isn’t in Delta, it’s the at-least-once semantics of the write path.

To fix it:

  • Introduce a unique, deterministic key and write via upserts / _id so retries are safe, or

  • Use a staging collection + atomic rename so partial/duplicate states never hit your live collection.

Either approach will eliminate duplicates even when Spark tasks are retried.

View solution in original post

1 REPLY 1

bianca_unifeye
New Contributor III

Hi Aziz,

What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.

Spark’s write path is task-based and fault-tolerant. If a task fails part-way through writing to MongoDB, Spark will retry that task.
From Spark’s perspective this is correct behaviour, but MongoDB has no idea it’s a ā€œretryā€, it just sees another insert.

If, at the time of the write:

  • There is no unique index on the target key

  • And you’re doing plain inserts (no upsert / idempotent key)

then the same row can be written twice when a task is retried.

Because your Delta source is clean and deduplicated, the issue isn’t in Delta, it’s the at-least-once semantics of the write path.

To fix it:

  • Introduce a unique, deterministic key and write via upserts / _id so retries are safe, or

  • Use a staging collection + atomic rename so partial/duplicate states never hit your live collection.

Either approach will eliminate duplicates even when Spark tasks are retried.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now