<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic My experience replacing a Postgres → Kafka → DMS → S3 pipeline with Lakeflow Connect in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/my-experience-replacing-a-postgres-kafka-dms-s3-pipeline-with/m-p/158528#M1263</link>
    <description>&lt;P class=""&gt;Sharing my hands-on experience with &lt;STRONG&gt;Lakeflow Connect&lt;/STRONG&gt; for anyone evaluating it for database ingestion. I recently moved data from &lt;STRONG&gt;PostgreSQL on AWS RDS&lt;/STRONG&gt; into Databricks, and it replaced a painful legacy pipeline. Keeping this simple and practical.&lt;/P&gt;&lt;H2&gt;What is Lakeflow Connect?&lt;/H2&gt;&lt;P class=""&gt;Lakeflow Connect is Databricks' &lt;STRONG&gt;managed ingestion&lt;/STRONG&gt; layer — built-in connectors that pull data from databases, SaaS apps, and file stores straight into the Lakehouse, with CDC, schema handling, and Unity Catalog governance managed for you.&lt;/P&gt;&lt;P class=""&gt;Quick distinction that confuses people:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Lakeflow Connect&lt;/STRONG&gt; → &lt;STRONG&gt;ingestion&lt;/STRONG&gt; (gets data &lt;EM&gt;into&lt;/EM&gt; Databricks). The front door.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Spark/Lakeflow Declarative Pipelines&lt;/STRONG&gt; → &lt;STRONG&gt;transformation&lt;/STRONG&gt; (shapes the data after it lands).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Structured Streaming&lt;/STRONG&gt; → the low-level engine where you write and manage the code yourself (sources, sinks, checkpoints).&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Connect sits &lt;EM&gt;before&lt;/EM&gt; your transformation pipelines and you don't hand-write the CDC code.&lt;/P&gt;&lt;H2&gt;What problem does it solve?&lt;/H2&gt;&lt;P class=""&gt;It collapses the usual multi-tool ingestion chain (CDC tool + message bus + migration service + S3 + custom merge code) into one managed connector. You get reliable inserts/updates/&lt;STRONG&gt;deletes&lt;/STRONG&gt;, schema-drift handling, governance via Unity Catalog, and a much smaller cost footprint.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Available sources (mid-2026):&lt;/STRONG&gt; Database CDC connectors for PostgreSQL, MySQL, SQL Server, Oracle (SQL Server GA; Postgres in Public Preview). SaaS connectors like Salesforce, Workday, ServiceNow, Zendesk, HubSpot, Jira. File connectors (SharePoint, Google Drive). Streaming connectors (RabbitMQ, Kafka). Plus query-based/federation sources. Always check current release state in the docs.&lt;/P&gt;&lt;H2&gt;The legacy system&lt;/H2&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;Postgres → Debezium → Kafka → DMS → S3 → (legacy structure code) → Delta tables&lt;/SPAN&gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Problems we kept hitting:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Deletes/updates not handled properly&lt;/STRONG&gt; — tables drifted into an inconsistent state.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data sync issues&lt;/STRONG&gt; with the Postgres source — frequent manual reconciliation.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DMS getting choked&lt;/STRONG&gt; under heavy load → pipeline failed. Recovery meant manually deleting checkpoints, re-running, and reloading data from scratch.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cost stacking&lt;/STRONG&gt; — paying for VMs + Databricks + Kafka + DMS all at once.&lt;/LI&gt;&lt;/OL&gt;&lt;H2&gt;The new system&lt;/H2&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;Postgres → Databricks&lt;/SPAN&gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;One managed connector replaces the entire middle. Advantages I actually felt:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Cost&lt;/STRONG&gt; — dropped Kafka, DMS, and supporting VMs entirely. Those were the budget drains.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Simplicity&lt;/STRONG&gt; — one place to look instead of six.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Reliability&lt;/STRONG&gt; — managed CDC handles deletes/updates; no more checkpoint surgery + full reloads.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Governance&lt;/STRONG&gt; — lands in Unity Catalog with access control and lineage out of the box.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;On pricing — correcting a common claim:&lt;/STRONG&gt; I've seen "free up to 100 GB until June 30" floating around. Per the docs, what's actually true is each workspace gets &lt;STRONG&gt;100 free DBUs/day&lt;/STRONG&gt; (~100M records/day) before standard pricing, plus a &lt;STRONG&gt;50% promotional discount until June 30, 2026&lt;/STRONG&gt;. Not a flat "100 GB free." Verify the current numbers on the official pricing page before quoting them.&lt;/P&gt;&lt;H2&gt;Implementation&lt;/H2&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.1 Connection + pipeline&lt;/STRONG&gt; — Register a Unity Catalog connection to the Postgres source (host, port, db, credentials via secret), then create an ingestion pipeline, select tables, map to a destination catalog/schema. Initial snapshot, then continuous CDC.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.2 Compute&lt;/STRONG&gt; — Through the &lt;STRONG&gt;UI you can only configure serverless&lt;/STRONG&gt;. To pick a specific classic compute node type, use the &lt;STRONG&gt;API&lt;/STRONG&gt;. In my case I selected an &lt;STRONG&gt;R-series instance, r5d.2xlarge with 3 workers&lt;/STRONG&gt; (memory-optimized, good fit for the merge workload). Tip: validate on serverless first, then move to API-defined compute for tuning.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.3 SCD Type 2&lt;/STRONG&gt; — For certain tables I enabled &lt;STRONG&gt;SCD Type 2&lt;/STRONG&gt; so every version of a row is preserved with validity markers — full history for auditing and point-in-time analysis. Other tables stay on latest-state only.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.4 Automation&lt;/STRONG&gt; — Wrapped everything in &lt;STRONG&gt;Declarative Automation Bundles (DABs)&lt;/STRONG&gt; — the new name for Databricks Asset Bundles (renamed March 2026; same databricks.yml workflow and databricks bundle CLI). Connection, pipeline, compute config, and table mappings as YAML in Git. Gives reusability, CI/CD deployment, and version control instead of manual UI clicks.&lt;/P&gt;&lt;H2&gt;Takeaway&lt;/H2&gt;&lt;P class=""&gt;Going from Postgres → Debezium → Kafka → DMS → S3 → legacy code → Databricks to just Postgres → Databricks removed entire failure categories, cut the cost stack, and gave me proper CDC + governance natively. If you maintain a hand-built CDC pipeline into Databricks, it's worth piloting one table on serverless, confirming deletes/updates behave, then scaling with API compute and Asset Bundles.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 08 Jun 2026 02:08:09 GMT</pubDate>
    <dc:creator>Rohansingh01</dc:creator>
    <dc:date>2026-06-08T02:08:09Z</dc:date>
    <item>
      <title>My experience replacing a Postgres → Kafka → DMS → S3 pipeline with Lakeflow Connect</title>
      <link>https://community.databricks.com/t5/community-articles/my-experience-replacing-a-postgres-kafka-dms-s3-pipeline-with/m-p/158528#M1263</link>
      <description>&lt;P class=""&gt;Sharing my hands-on experience with &lt;STRONG&gt;Lakeflow Connect&lt;/STRONG&gt; for anyone evaluating it for database ingestion. I recently moved data from &lt;STRONG&gt;PostgreSQL on AWS RDS&lt;/STRONG&gt; into Databricks, and it replaced a painful legacy pipeline. Keeping this simple and practical.&lt;/P&gt;&lt;H2&gt;What is Lakeflow Connect?&lt;/H2&gt;&lt;P class=""&gt;Lakeflow Connect is Databricks' &lt;STRONG&gt;managed ingestion&lt;/STRONG&gt; layer — built-in connectors that pull data from databases, SaaS apps, and file stores straight into the Lakehouse, with CDC, schema handling, and Unity Catalog governance managed for you.&lt;/P&gt;&lt;P class=""&gt;Quick distinction that confuses people:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Lakeflow Connect&lt;/STRONG&gt; → &lt;STRONG&gt;ingestion&lt;/STRONG&gt; (gets data &lt;EM&gt;into&lt;/EM&gt; Databricks). The front door.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Spark/Lakeflow Declarative Pipelines&lt;/STRONG&gt; → &lt;STRONG&gt;transformation&lt;/STRONG&gt; (shapes the data after it lands).&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Structured Streaming&lt;/STRONG&gt; → the low-level engine where you write and manage the code yourself (sources, sinks, checkpoints).&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;Connect sits &lt;EM&gt;before&lt;/EM&gt; your transformation pipelines and you don't hand-write the CDC code.&lt;/P&gt;&lt;H2&gt;What problem does it solve?&lt;/H2&gt;&lt;P class=""&gt;It collapses the usual multi-tool ingestion chain (CDC tool + message bus + migration service + S3 + custom merge code) into one managed connector. You get reliable inserts/updates/&lt;STRONG&gt;deletes&lt;/STRONG&gt;, schema-drift handling, governance via Unity Catalog, and a much smaller cost footprint.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Available sources (mid-2026):&lt;/STRONG&gt; Database CDC connectors for PostgreSQL, MySQL, SQL Server, Oracle (SQL Server GA; Postgres in Public Preview). SaaS connectors like Salesforce, Workday, ServiceNow, Zendesk, HubSpot, Jira. File connectors (SharePoint, Google Drive). Streaming connectors (RabbitMQ, Kafka). Plus query-based/federation sources. Always check current release state in the docs.&lt;/P&gt;&lt;H2&gt;The legacy system&lt;/H2&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;Postgres → Debezium → Kafka → DMS → S3 → (legacy structure code) → Delta tables&lt;/SPAN&gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Problems we kept hitting:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Deletes/updates not handled properly&lt;/STRONG&gt; — tables drifted into an inconsistent state.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data sync issues&lt;/STRONG&gt; with the Postgres source — frequent manual reconciliation.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DMS getting choked&lt;/STRONG&gt; under heavy load → pipeline failed. Recovery meant manually deleting checkpoints, re-running, and reloading data from scratch.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Cost stacking&lt;/STRONG&gt; — paying for VMs + Databricks + Kafka + DMS all at once.&lt;/LI&gt;&lt;/OL&gt;&lt;H2&gt;The new system&lt;/H2&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;PRE&gt;&lt;SPAN&gt;Postgres → Databricks&lt;/SPAN&gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;One managed connector replaces the entire middle. Advantages I actually felt:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;Cost&lt;/STRONG&gt; — dropped Kafka, DMS, and supporting VMs entirely. Those were the budget drains.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Simplicity&lt;/STRONG&gt; — one place to look instead of six.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Reliability&lt;/STRONG&gt; — managed CDC handles deletes/updates; no more checkpoint surgery + full reloads.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Governance&lt;/STRONG&gt; — lands in Unity Catalog with access control and lineage out of the box.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;On pricing — correcting a common claim:&lt;/STRONG&gt; I've seen "free up to 100 GB until June 30" floating around. Per the docs, what's actually true is each workspace gets &lt;STRONG&gt;100 free DBUs/day&lt;/STRONG&gt; (~100M records/day) before standard pricing, plus a &lt;STRONG&gt;50% promotional discount until June 30, 2026&lt;/STRONG&gt;. Not a flat "100 GB free." Verify the current numbers on the official pricing page before quoting them.&lt;/P&gt;&lt;H2&gt;Implementation&lt;/H2&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.1 Connection + pipeline&lt;/STRONG&gt; — Register a Unity Catalog connection to the Postgres source (host, port, db, credentials via secret), then create an ingestion pipeline, select tables, map to a destination catalog/schema. Initial snapshot, then continuous CDC.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.2 Compute&lt;/STRONG&gt; — Through the &lt;STRONG&gt;UI you can only configure serverless&lt;/STRONG&gt;. To pick a specific classic compute node type, use the &lt;STRONG&gt;API&lt;/STRONG&gt;. In my case I selected an &lt;STRONG&gt;R-series instance, r5d.2xlarge with 3 workers&lt;/STRONG&gt; (memory-optimized, good fit for the merge workload). Tip: validate on serverless first, then move to API-defined compute for tuning.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.3 SCD Type 2&lt;/STRONG&gt; — For certain tables I enabled &lt;STRONG&gt;SCD Type 2&lt;/STRONG&gt; so every version of a row is preserved with validity markers — full history for auditing and point-in-time analysis. Other tables stay on latest-state only.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;5.4 Automation&lt;/STRONG&gt; — Wrapped everything in &lt;STRONG&gt;Declarative Automation Bundles (DABs)&lt;/STRONG&gt; — the new name for Databricks Asset Bundles (renamed March 2026; same databricks.yml workflow and databricks bundle CLI). Connection, pipeline, compute config, and table mappings as YAML in Git. Gives reusability, CI/CD deployment, and version control instead of manual UI clicks.&lt;/P&gt;&lt;H2&gt;Takeaway&lt;/H2&gt;&lt;P class=""&gt;Going from Postgres → Debezium → Kafka → DMS → S3 → legacy code → Databricks to just Postgres → Databricks removed entire failure categories, cut the cost stack, and gave me proper CDC + governance natively. If you maintain a hand-built CDC pipeline into Databricks, it's worth piloting one table on serverless, confirming deletes/updates behave, then scaling with API compute and Asset Bundles.&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jun 2026 02:08:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/my-experience-replacing-a-postgres-kafka-dms-s3-pipeline-with/m-p/158528#M1263</guid>
      <dc:creator>Rohansingh01</dc:creator>
      <dc:date>2026-06-08T02:08:09Z</dc:date>
    </item>
    <item>
      <title>Re: My experience replacing a Postgres → Kafka → DMS → S3 pipeline with Lakeflow Connect</title>
      <link>https://community.databricks.com/t5/community-articles/my-experience-replacing-a-postgres-kafka-dms-s3-pipeline-with/m-p/158579#M1264</link>
      <description>&lt;P&gt;Great article!&lt;/P&gt;</description>
      <pubDate>Mon, 08 Jun 2026 21:01:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/my-experience-replacing-a-postgres-kafka-dms-s3-pipeline-with/m-p/158579#M1264</guid>
      <dc:creator>rdokala</dc:creator>
      <dc:date>2026-06-08T21:01:24Z</dc:date>
    </item>
  </channel>
</rss>

