Databricks Community

ShivamKumar7788 · 3 weeks ago

Databricks CustomerLake: Inside the Agentic CDP Built for the Age of AI

A deep dive into what CustomerLake actually is, how it works, and what it looks like in practice.

At Data + AI Summit 2026, Databricks announced CustomerLake — an Agentic Customer Data Platform built natively inside the Databricks Lakehouse.

Not a standalone tool. Not a separate layer bolted on top. The thinking behind it is simple: rather than pulling customer data out into yet another platform, bring the CDP capabilities directly into the environment where that data already lives — with governance, security, and data infrastructure already in place.

This post walks through what CustomerLake covers, how it works, and what the product actually looks like — drawing from the official keynote, product demo, and launch materials.

The Problem CustomerLake Is Solving

Marketing at most enterprises still follows a familiar sequence. A plan gets defined. Data teams pull together what is needed. Audiences are assembled. A campaign gets configured in some automation tool, pushed out, and then measured. Rinse and repeat.

The cycle has worked well enough — but it moves slowly. Building and refining campaigns typically runs across weeks, sometimes months. And the output, despite all the effort, tends to be broad. The same message going out to large groups, not genuinely tailored to individual customers.

Meanwhile, the buying side is evolving fast. AI agents are now doing research, comparing options, and making purchases on behalf of consumers — always available, reacting to new context almost immediately, operating across a growing number of channels. Marketing built around weekly batch cycles does not keep pace with that.

The Concept: Infinity Campaigns

CustomerLake is built around a core idea Databricks calls Infinity Campaigns.

Traditional campaigns are time-boxed — they start, run to a predefined audience, and end. Infinity Campaigns work differently. They are continuous engagement loops, always running, with no fixed end state. Every customer gets evaluated individually in real time, which means the one-to-many model gives way to something closer to true one-to-one engagement.

The underlying logic: customer actions and signals get picked up by enterprise-side agents, processed against the customer's full profile and context, and a decision gets made — does this person need to hear from us right now, and if so, what and through which channel? When an action is taken, that interaction becomes a new signal, feeding back into the same loop.

Evergreen. Always adapting. No campaign relaunch required.

What Makes a CDP "Agentic"

CDPs have historically served three core functions: building a unified customer profile, enabling marketers to define audience segments, and pushing those segments out to execution tools like email or mobile platforms.

The limitation has always been architectural. CDPs lived outside the core data platform. They needed their own copy of the customer data, maintained their own governance layer, and required ongoing data movement to stay current.

For an agentic approach to work, that architecture breaks down. Agents need access to everything — customer history, behavioral context, business rules, predictive models, campaign performance — all in one place, without data movement introducing lag or gaps.

Databricks built CustomerLake around three requirements for what an agentic CDP needs to be:

Embedded in the lakehouse — customer data, context, and agents share the same infrastructure. No copies, no sync jobs, no reconciliation between systems.

Built around agents as the core operating model — not a conventional platform with an AI feature added. The agent is how data gets prepared, how audiences get shaped, how campaigns get planned, and how decisions get made per customer.

Capable of true one-to-one personalization at scale — not segments of thousands, but individual decisions made continuously for every customer in the system.

The Architecture

CustomerLake has two main components: Profile Agents and Campaign Agents.

Raw customer data flows in, gets processed by Profile Agents into clean unified profiles, and those profiles become the foundation for Campaign Agents to run Infinity Campaigns. A built-in Reverse ETL layer handles pushing decisions and audience data back out to the execution tools that reach customers — email platforms, ad networks, SMS, in-app messaging, and more.

Data sources include anything already sitting in the Databricks Lakehouse, plus external data from MarTech and CRM systems brought in through Lakeflow Connect. Unity Catalog handles governance across the whole stack — the same controls that apply to the rest of the data estate apply here too.

Profile Agents: Building the Customer 360

Getting to a reliable, unified customer profile is foundational to everything else. Profile Agents handle the full pipeline to get there.

Data Preparation

Lakeflow Connect brings in external data — from CRM platforms, MarTech tools, and third-party sources — alongside whatever is already in the lakehouse. Once a new dataset lands, Genie (Databricks' AI layer) takes over the preparation work.

It reads the dataset, identifies what each column actually represents — email, phone number, full name, address — and applies semantic tags accordingly. It then generates normalization rules to clean and standardize the data automatically, handling inconsistencies and filtering out invalid values without any manual mapping.

Third-party data enrichment is accessible through a Data & Identity Marketplace — providers can be connected and their data pulled in with a single click.

Identity Resolution

Matching records across different data sources — recognizing that two entries with slightly different details actually represent the same person — has always been one of the harder problems in customer data work.

CustomerLake handles this through what Databricks calls Agentic Identity Resolution, which runs across three stages:

Rules-based matching covers the straightforward cases — exact matches on unique IDs, normalized email addresses, or combinations like phone number with a fuzzy name match. The rules are readable and configurable.

LLM review handles the middle ground — cases where the rules do not reach a confident conclusion. A language model steps in to assess whether two profiles are likely the same person.

Human review is reserved for the genuinely uncertain — a queue where a person makes the final determination.

What ties this together is a feedback loop. Every decision made at the LLM and human stages gets incorporated back into the rules layer, so each run of the identity resolution process is more accurate than the last. Organizations can also bring their own ML models into the pipeline if they already have them.

When a new data source is added, Genie automatically analyzes it against existing matching rules and recommends additional rules where gaps or opportunities are identified — explaining the reasoning behind each suggestion and previewing the expected impact before anything is applied.

Gold Customer Table

The end product of Profile Agents is a Gold Customer Table — a single governed schema that every data source maps into. Where sources disagree on a field value, survivorship rules decide which one wins. The whole thing is configurable through a UI or YAML, so both technical and non-technical team members can work with it.

Campaign Agents: From Goal to Individual Decision

With a clean, unified customer profile in place, Campaign Agents take over — translating business goals into personalized, continuously running campaigns.

Building Audiences

Audience creation works through natural language. A marketer describes the audience they need, and Genie builds the segment directly against live lakehouse data. No SQL. No hand-off to a data analyst.

A marketer describes the audience they need in plain language, and Genie builds the segment directly against live lakehouse data — no SQL, no hand-off to a data analyst. Existing audiences can be refined further the same way, by simply describing the additional conditions needed. Genie converts the description into precise data filters and updates the segment instantly.

Audience insights — size trend over time, purchase category breakdown, average spend, churn risk — are surfaced automatically. Suppression rules reference live data conditions rather than point-in-time exports, so someone who converts mid-campaign is removed from eligibility immediately, not at the next scheduled refresh.

Campaign Planning

Turning an audience and a goal into a campaign starts with a brief conversation. Genie asks a focused set of questions — which channels to use, how many messages to send, when the campaign should conclude — and uses the answers to generate a structured campaign brief.

The brief covers the full picture: goals and success metrics, a sequenced messaging plan with rationale per touchpoint, timing and cadence, guardrails (frequency limits, opt-out lists, suppression of customers with open support tickets), personalization signals to draw on, and the assumptions behind the plan.

This document becomes the foundation the campaign is built from. It is fully editable before anything gets built.

Decisioning and Reasoning

Before going live, Campaign Agents can run a pre-launch simulation across a sample of real qualified profiles. The simulation shows what the agent would actually do for each person — which message they would receive, whether they would be deferred based on existing campaign load — without triggering any actual sends.

Each profile in the simulation comes with a Reasoning panel: a plain-language explanation of why that specific message was chosen, which rule it matched, and why the send timing was set the way it was. The agent also accounts for campaigns running in parallel — if a customer is already receiving heavy outreach from another active campaign, that factors into the decision before anything goes out.

This kind of per-profile transparency, available before launch rather than after a complaint, changes how marketers can review and trust the decisioning layer.

Performance and Activation

Once a campaign is live, Campaign Agents monitor it continuously — flagging performance trends and suggesting adjustments in real time. Native A/B testing makes variant comparison straightforward across the key engagement metrics.

Activation runs through Reverse ETL — bi-directional connections to the MarTech and AdTech tools already in use, covering email, SMS, in-app, and advertising platforms.

Early Customers and Partners

CustomerLake has been in private rollout with select enterprise customers ahead of the public announcement. Early customers include GM, AB InBev, HP, Circle K, Barclays, and Getnet.

The platform launches with an open partner ecosystem spanning identity, activation, measurement, and customer experience — alongside implementation partners supporting deployment at enterprise scale.

Where Things Stand

CustomerLake is currently available in Private Preview. Organizations interested in early access should reach out to their Databricks account team.

The product makes the most sense for teams whose data foundation is already on Databricks — the value comes from not having to replicate that foundation elsewhere to support marketing use cases. If the data is already there, the Customer 360, the audiences, and the campaign intelligence can be built on top of it directly, under the same governance that covers the rest of the data estate.

Sources: Introducing CustomerLake: The Agentic CDP embedded in Databricks — Databricks Blog Introducing Databricks CustomerLake — Official YouTube

Shivam Kumar
Senior Data Engineer