Recommended way to integrate MongoDB as a streaming source

amichel
New Contributor III

Current state:

  • Data is stored in MongoDB Atlas which is used extensively by all services
  • Data lake is hosted in same AWS region and connected to MongoDB over private link
  •  

Requirements:

  • Streaming pipelines that continuously ingest, transform/analyze and serve data with lowest possible latency
  • Downstream processed data is aggregated and stored in the data lake, while it is also required to be available as a stream to external subscribers (via AWS MSK potentially)

Question: what is the recommended (and reliable) way to ingest MongoDB Atlas as a stream?

option 1: Use mongo change streams and have Kafka Connect and Kafka topic to proxy between Mongo and Databricks, such that Databricks is only aware of Kafka topics

option 2: Connect to mongo directly using mongo-spark connector and watching the collection explicitly. This might require some binding via in-memory queue or something similar that can be observed in scala, as well as managing checkpoints, etc.

any other ideas? any feedback from someone who implemented this in production appreciated.