<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Recommended way to integrate MongoDB as a streaming source in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27308#M19185</link>
    <description>&lt;P&gt;Current state:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data is stored in MongoDB Atlas which is used extensively by all services&lt;/LI&gt;&lt;LI&gt;Data lake is hosted in same AWS region and connected to MongoDB over private link&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Requirements:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Streaming pipelines that continuously ingest, transform/analyze and serve data with lowest possible latency&lt;/LI&gt;&lt;LI&gt;Downstream processed data is aggregated and stored in the data lake, while it is also required to be available as a stream to external subscribers (via AWS MSK potentially)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Question: what is the recommended (and reliable) way to ingest MongoDB Atlas as a stream? &lt;/P&gt;&lt;P&gt;option 1: Use mongo change streams and have Kafka Connect and Kafka topic to proxy between Mongo and  Databricks, such that Databricks is only aware of Kafka topics&lt;/P&gt;&lt;P&gt;option 2: Connect to mongo directly using mongo-spark connector and watching the collection explicitly. This might require some binding via in-memory queue or something similar that can be observed in scala, as well as managing checkpoints, etc.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;any other ideas?  any feedback from someone who implemented this in production appreciated. &lt;/P&gt;</description>
    <pubDate>Tue, 22 Feb 2022 19:40:24 GMT</pubDate>
    <dc:creator>amichel</dc:creator>
    <dc:date>2022-02-22T19:40:24Z</dc:date>
    <item>
      <title>Recommended way to integrate MongoDB as a streaming source</title>
      <link>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27308#M19185</link>
      <description>&lt;P&gt;Current state:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data is stored in MongoDB Atlas which is used extensively by all services&lt;/LI&gt;&lt;LI&gt;Data lake is hosted in same AWS region and connected to MongoDB over private link&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Requirements:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Streaming pipelines that continuously ingest, transform/analyze and serve data with lowest possible latency&lt;/LI&gt;&lt;LI&gt;Downstream processed data is aggregated and stored in the data lake, while it is also required to be available as a stream to external subscribers (via AWS MSK potentially)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Question: what is the recommended (and reliable) way to ingest MongoDB Atlas as a stream? &lt;/P&gt;&lt;P&gt;option 1: Use mongo change streams and have Kafka Connect and Kafka topic to proxy between Mongo and  Databricks, such that Databricks is only aware of Kafka topics&lt;/P&gt;&lt;P&gt;option 2: Connect to mongo directly using mongo-spark connector and watching the collection explicitly. This might require some binding via in-memory queue or something similar that can be observed in scala, as well as managing checkpoints, etc.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;any other ideas?  any feedback from someone who implemented this in production appreciated. &lt;/P&gt;</description>
      <pubDate>Tue, 22 Feb 2022 19:40:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27308#M19185</guid>
      <dc:creator>amichel</dc:creator>
      <dc:date>2022-02-22T19:40:24Z</dc:date>
    </item>
    <item>
      <title>Re: Recommended way to integrate MongoDB as a streaming source</title>
      <link>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27309#M19186</link>
      <description>&lt;P&gt;We went with pretty much approach No. 1 you outlined above. One thing that I would recommend is that you setup a schema registry and leverage Avro, such that the messages that are going to the Kafka topic are Avro messages that must comply to a schema registry that your Databricks streaming service will be able to ingest by checking the schema first against such registry.&lt;/P&gt;</description>
      <pubDate>Tue, 31 May 2022 13:13:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27309#M19186</guid>
      <dc:creator>dbarrundiag</dc:creator>
      <dc:date>2022-05-31T13:13:12Z</dc:date>
    </item>
    <item>
      <title>Re: Recommended way to integrate MongoDB as a streaming source</title>
      <link>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27310#M19187</link>
      <description>&lt;P&gt;Another option if you'd like to use Spark as the ingestion is to use the new Spark Connector V10.0 which support Spark Structured Streaming. &lt;A href="https://www.mongodb.com/developer/languages/python/streaming-data-apache-spark-mongodb/" alt="https://www.mongodb.com/developer/languages/python/streaming-data-apache-spark-mongodb/" target="_blank"&gt;https://www.mongodb.com/developer/languages/python/streaming-data-apache-spark-mongodb/&lt;/A&gt;. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you use Kafka, the MongoDB Connector when used as a source creates a change stream under the covers and has a flag, "copy.existing" which will copy the existing data first then start the stream of data.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jun 2022 17:44:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27310#M19187</guid>
      <dc:creator>robwma</dc:creator>
      <dc:date>2022-06-21T17:44:54Z</dc:date>
    </item>
    <item>
      <title>Re: Recommended way to integrate MongoDB as a streaming source</title>
      <link>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27312#M19189</link>
      <description>Hi,&lt;BR /&gt;Eventually we agreed on a solution to use MongoDB Atlas $out feature to export to S3 and ingest files using Databricks Autoloader as a stream.&lt;BR /&gt;&lt;A href="https://www.mongodb.com/developer/products/atlas/automated-continuous-data-copying-from-mongodb-to-s3/" target="test_blank"&gt;https://www.mongodb.com/developer/products/atlas/automated-continuous-data-copying-from-mongodb-to-s3/&lt;/A&gt;&lt;BR /&gt;Will need to try the new connector too as suggested here.</description>
      <pubDate>Thu, 07 Jul 2022 12:03:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/recommended-way-to-integrate-mongodb-as-a-streaming-source/m-p/27312#M19189</guid>
      <dc:creator>amichel</dc:creator>
      <dc:date>2022-07-07T12:03:15Z</dc:date>
    </item>
  </channel>
</rss>

