Databricks Community

Rohansingh01 · Sunday

Sharing my hands-on experience with Lakeflow Connect for anyone evaluating it for database ingestion. I recently moved data from PostgreSQL on AWS RDS into Databricks, and it replaced a painful legacy pipeline. Keeping this simple and practical.

What is Lakeflow Connect?

Lakeflow Connect is Databricks' managed ingestion layer — built-in connectors that pull data from databases, SaaS apps, and file stores straight into the Lakehouse, with CDC, schema handling, and Unity Catalog governance managed for you.

Quick distinction that confuses people:

Lakeflow Connect → ingestion (gets data into Databricks). The front door.
Spark/Lakeflow Declarative Pipelines → transformation (shapes the data after it lands).
Structured Streaming → the low-level engine where you write and manage the code yourself (sources, sinks, checkpoints).

Connect sits before your transformation pipelines and you don't hand-write the CDC code.

What problem does it solve?

It collapses the usual multi-tool ingestion chain (CDC tool + message bus + migration service + S3 + custom merge code) into one managed connector. You get reliable inserts/updates/deletes, schema-drift handling, governance via Unity Catalog, and a much smaller cost footprint.

Available sources (mid-2026): Database CDC connectors for PostgreSQL, MySQL, SQL Server, Oracle (SQL Server GA; Postgres in Public Preview). SaaS connectors like Salesforce, Workday, ServiceNow, Zendesk, HubSpot, Jira. File connectors (SharePoint, Google Drive). Streaming connectors (RabbitMQ, Kafka). Plus query-based/federation sources. Always check current release state in the docs.

The legacy system

Postgres → Debezium → Kafka → DMS → S3 → (legacy structure code) → Delta tables

Problems we kept hitting:

Deletes/updates not handled properly — tables drifted into an inconsistent state.
Data sync issues with the Postgres source — frequent manual reconciliation.
DMS getting choked under heavy load → pipeline failed. Recovery meant manually deleting checkpoints, re-running, and reloading data from scratch.
Cost stacking — paying for VMs + Databricks + Kafka + DMS all at once.

The new system

Postgres → Databricks

One managed connector replaces the entire middle. Advantages I actually felt:

Cost — dropped Kafka, DMS, and supporting VMs entirely. Those were the budget drains.
Simplicity — one place to look instead of six.
Reliability — managed CDC handles deletes/updates; no more checkpoint surgery + full reloads.
Governance — lands in Unity Catalog with access control and lineage out of the box.

On pricing — correcting a common claim: I've seen "free up to 100 GB until June 30" floating around. Per the docs, what's actually true is each workspace gets 100 free DBUs/day (~100M records/day) before standard pricing, plus a 50% promotional discount until June 30, 2026. Not a flat "100 GB free." Verify the current numbers on the official pricing page before quoting them.

Implementation

5.1 Connection + pipeline — Register a Unity Catalog connection to the Postgres source (host, port, db, credentials via secret), then create an ingestion pipeline, select tables, map to a destination catalog/schema. Initial snapshot, then continuous CDC.

5.2 Compute — Through the UI you can only configure serverless. To pick a specific classic compute node type, use the API. In my case I selected an R-series instance, r5d.2xlarge with 3 workers (memory-optimized, good fit for the merge workload). Tip: validate on serverless first, then move to API-defined compute for tuning.

5.3 SCD Type 2 — For certain tables I enabled SCD Type 2 so every version of a row is preserved with validity markers — full history for auditing and point-in-time analysis. Other tables stay on latest-state only.

5.4 Automation — Wrapped everything in Declarative Automation Bundles (DABs) — the new name for Databricks Asset Bundles (renamed March 2026; same databricks.yml workflow and databricks bundle CLI). Connection, pipeline, compute config, and table mappings as YAML in Git. Gives reusability, CI/CD deployment, and version control instead of manual UI clicks.

Takeaway

Going from Postgres → Debezium → Kafka → DMS → S3 → legacy code → Databricks to just Postgres → Databricks removed entire failure categories, cut the cost stack, and gave me proper CDC + governance natively. If you maintain a hand-built CDC pipeline into Databricks, it's worth piloting one table on serverless, confirming deletes/updates behave, then scaling with API compute and Asset Bundles.