<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: mongodb connector duplicate writes in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/mongodb-connector-duplicate-writes/m-p/141248#M51673</link>
    <description>&lt;P&gt;Hi Aziz,&lt;/P&gt;&lt;P&gt;What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.&lt;/P&gt;&lt;P&gt;Spark’s write path is &lt;STRONG&gt;task-based and fault-tolerant&lt;/STRONG&gt;. If a task fails part-way through writing to MongoDB, Spark will retry that task.&lt;BR /&gt;From Spark’s perspective this is correct behaviour, but MongoDB has no idea it’s a “retry”, it just sees another insert.&lt;/P&gt;&lt;P&gt;If, at the time of the write:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;There is &lt;STRONG&gt;no unique index&lt;/STRONG&gt; on the target key&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;And you’re doing &lt;STRONG&gt;plain inserts&lt;/STRONG&gt; (no upsert / idempotent key)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;then the same row can be written twice when a task is retried.&lt;/P&gt;&lt;P&gt;Because your Delta source is clean and deduplicated, the issue isn’t in Delta, it’s the &lt;STRONG&gt;at-least-once semantics&lt;/STRONG&gt; of the write path.&lt;/P&gt;&lt;P&gt;To fix it:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Introduce a &lt;STRONG&gt;unique, deterministic key&lt;/STRONG&gt; and write via upserts / _id so retries are safe, &lt;STRONG&gt;or&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Use a &lt;STRONG&gt;staging collection + atomic rename&lt;/STRONG&gt; so partial/duplicate states never hit your live collection.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Either approach will eliminate duplicates even when Spark tasks are retried.&lt;/P&gt;</description>
    <pubDate>Fri, 05 Dec 2025 12:54:22 GMT</pubDate>
    <dc:creator>bianca_unifeye</dc:creator>
    <dc:date>2025-12-05T12:54:22Z</dc:date>
    <item>
      <title>mongodb connector duplicate writes</title>
      <link>https://community.databricks.com/t5/data-engineering/mongodb-connector-duplicate-writes/m-p/141243#M51669</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Has anyone run into this issue? I’m using the MongoDB Spark Connector on Databricks to expose data from Delta Lake to MongoDB. My workflow is:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;overwrite the collection (very fast),&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;then create the indexes.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Occasionally, I’m seeing duplicates appear in MongoDB even though the Delta Lake source contains no duplicates. It looks like some Spark tasks sometimes fail and get retried, which leads to the same data being written twice, since there’s no uniqueness constraint at that moment.&lt;/P&gt;&lt;P&gt;Has anyone dealt with this behavior or found a reliable way to prevent duplicates during writes?&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2025 12:28:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/mongodb-connector-duplicate-writes/m-p/141243#M51669</guid>
      <dc:creator>__Aziz__</dc:creator>
      <dc:date>2025-12-05T12:28:12Z</dc:date>
    </item>
    <item>
      <title>Re: mongodb connector duplicate writes</title>
      <link>https://community.databricks.com/t5/data-engineering/mongodb-connector-duplicate-writes/m-p/141248#M51673</link>
      <description>&lt;P&gt;Hi Aziz,&lt;/P&gt;&lt;P&gt;What you’re seeing is an expected behaviour when combining Spark retries with non-idempotent writes.&lt;/P&gt;&lt;P&gt;Spark’s write path is &lt;STRONG&gt;task-based and fault-tolerant&lt;/STRONG&gt;. If a task fails part-way through writing to MongoDB, Spark will retry that task.&lt;BR /&gt;From Spark’s perspective this is correct behaviour, but MongoDB has no idea it’s a “retry”, it just sees another insert.&lt;/P&gt;&lt;P&gt;If, at the time of the write:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;There is &lt;STRONG&gt;no unique index&lt;/STRONG&gt; on the target key&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;And you’re doing &lt;STRONG&gt;plain inserts&lt;/STRONG&gt; (no upsert / idempotent key)&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;then the same row can be written twice when a task is retried.&lt;/P&gt;&lt;P&gt;Because your Delta source is clean and deduplicated, the issue isn’t in Delta, it’s the &lt;STRONG&gt;at-least-once semantics&lt;/STRONG&gt; of the write path.&lt;/P&gt;&lt;P&gt;To fix it:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Introduce a &lt;STRONG&gt;unique, deterministic key&lt;/STRONG&gt; and write via upserts / _id so retries are safe, &lt;STRONG&gt;or&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Use a &lt;STRONG&gt;staging collection + atomic rename&lt;/STRONG&gt; so partial/duplicate states never hit your live collection.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Either approach will eliminate duplicates even when Spark tasks are retried.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Dec 2025 12:54:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/mongodb-connector-duplicate-writes/m-p/141248#M51673</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-12-05T12:54:22Z</dc:date>
    </item>
  </channel>
</rss>

