- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2022 11:40 AM
Current state:
- Data is stored in MongoDB Atlas which is used extensively by all services
- Data lake is hosted in same AWS region and connected to MongoDB over private link
Requirements:
- Streaming pipelines that continuously ingest, transform/analyze and serve data with lowest possible latency
- Downstream processed data is aggregated and stored in the data lake, while it is also required to be available as a stream to external subscribers (via AWS MSK potentially)
Question: what is the recommended (and reliable) way to ingest MongoDB Atlas as a stream?
option 1: Use mongo change streams and have Kafka Connect and Kafka topic to proxy between Mongo and Databricks, such that Databricks is only aware of Kafka topics
option 2: Connect to mongo directly using mongo-spark connector and watching the collection explicitly. This might require some binding via in-memory queue or something similar that can be observed in scala, as well as managing checkpoints, etc.
any other ideas? any feedback from someone who implemented this in production appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-31-2022 06:13 AM
We went with pretty much approach No. 1 you outlined above. One thing that I would recommend is that you setup a schema registry and leverage Avro, such that the messages that are going to the Kafka topic are Avro messages that must comply to a schema registry that your Databricks streaming service will be able to ingest by checking the schema first against such registry.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2022 10:44 AM
Another option if you'd like to use Spark as the ingestion is to use the new Spark Connector V10.0 which support Spark Structured Streaming. https://www.mongodb.com/developer/languages/python/streaming-data-apache-spark-mongodb/.
If you use Kafka, the MongoDB Connector when used as a source creates a change stream under the covers and has a flag, "copy.existing" which will copy the existing data first then start the stream of data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-07-2022 05:03 AM
Eventually we agreed on a solution to use MongoDB Atlas $out feature to export to S3 and ingest files using Databricks Autoloader as a stream.
https://www.mongodb.com/developer/products/atlas/automated-continuous-data-copying-from-mongodb-to-s...
Will need to try the new connector too as suggested here.