Current state:
- Data is stored in MongoDB Atlas which is used extensively by all services
- Data lake is hosted in same AWS region and connected to MongoDB over private link
-
Requirements:
- Streaming pipelines that continuously ingest, transform/analyze and serve data with lowest possible latency
- Downstream processed data is aggregated and stored in the data lake, while it is also required to be available as a stream to external subscribers (via AWS MSK potentially)
Question: what is the recommended (and reliable) way to ingest MongoDB Atlas as a stream?
option 1: Use mongo change streams and have Kafka Connect and Kafka topic to proxy between Mongo and Databricks, such that Databricks is only aware of Kafka topics
option 2: Connect to mongo directly using mongo-spark connector and watching the collection explicitly. This might require some binding via in-memory queue or something similar that can be observed in scala, as well as managing checkpoints, etc.
any other ideas? any feedback from someone who implemented this in production appreciated.