Databricks Community

AustinBen · 3 weeks ago

Hi everyone,

I'm looking for advice from anyone who has implemented near real-time ingestion from Amazon DocumentDB into Databricks.

Our current architecture is:

Application → Amazon DocumentDB
Python AWS Lambda functions capture changes from DocumentDB
Lambda continuously writes the data into Amazon Redshift
Redshift is then used as our data warehouse

This setup has been working well for us.

We're now evaluating Databricks as our analytics platform, but I'm not finding a straightforward way to stream data directly from DocumentDB into Databricks. I've heard that Databricks doesn't have a native connector or CDC support for Amazon DocumentDB.

My questions are:

Has anyone successfully implemented near real-time or real-time ingestion from Amazon DocumentDB into Databricks?
What architecture are you using?

I'm interested in production-proven architectures rather than proof-of-concept examples.

Thanks in advance!

anagilla · 2 weeks ago

The best pattern I can think of is to put a streaming bus between DocumentDB and Databricks and consume it with Structured Streaming. You are most of the way there already.

Lowest-disruption path, since you already capture changes in Lambda:

Repoint your Lambda to publish DocumentDB change events to Amazon Kinesis Data Streams (or MSK) instead of, or alongside, Redshift.
Read that stream in Databricks Structured Streaming (native Kinesis and Kafka/MSK sources) into an append-only Bronze Delta table. Keep the document payload as VARIANT or string so an upstream schema change does not break ingestion.
Fold inserts, updates, and deletes into a current-state Silver table with a MERGE in foreachBatch, or AUTO CDC (APPLY CHANGES INTO) in a Lakeflow declarative pipeline, keyed by _id.

If you would rather drop the Lambda, AWS DMS supports DocumentDB as a source and can land CDC to Kinesis or MSK (then stream as above), or to S3 read with Auto Loader for a micro-batch option.

Two things to plan for: enable change streams and watch their retention window (a consumer that falls behind past retention needs a snapshot backfill plus the stream), and pick your trigger by latency need, Trigger.AvailableNow for cheap incremental batches or a continuous / short processingTime trigger for true near-real-time.

Databricks Community

Streaming Amazon DocumentDB to Databricks in near real time - what's the best approach?

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap