Streaming Amazon DocumentDB to Databricks in near real time - what's the best approach?

AustinBen
Visitor

Hi everyone,

I'm looking for advice from anyone who has implemented near real-time ingestion from Amazon DocumentDB into Databricks.

Our current architecture is:

  • Application → Amazon DocumentDB

  • Python AWS Lambda functions capture changes from DocumentDB

  • Lambda continuously writes the data into Amazon Redshift

  • Redshift is then used as our data warehouse

This setup has been working well for us.

We're now evaluating Databricks as our analytics platform, but I'm not finding a straightforward way to stream data directly from DocumentDB into Databricks. I've heard that Databricks doesn't have a native connector or CDC support for Amazon DocumentDB.

My questions are:

  1. Has anyone successfully implemented near real-time or real-time ingestion from Amazon DocumentDB into Databricks?

  2. What architecture are you using?

I'm interested in production-proven architectures rather than proof-of-concept examples.

Thanks in advance!