Databricks Community

Agre_Celebal · 6 hours ago

The OLTP architecture your agentic systems actually need, and how it compares to Supabase, Azure PostgreSQL, and Cosmos DB

Earlier this year, Nikita Shamgunov — the engineer leading Databricks Lakebase — published a number that reframed my entire architecture review: AI agents now create roughly 4x more databases than human developers.

Not 4x more queries. 4x more databases.

If you're building agentic AI systems on Databricks and still reaching for Supabase, Azure Database for PostgreSQL, or Cosmos DB as your OLTP layer — this article will challenge that decision. Not because those platforms are bad. They're not. But because they were designed for a world where humans write schemas, humans provision databases, and humans decide when something scales. Agents don't work that way. And the architecture that serves human-paced development quietly breaks under agentic workloads.

I learned this the hard way while building an internal Agentic Intelligence Platform at Celebal Technologies — three agent modules (Swarm Coordination, Ontology-Based Reasoning, and Causal Optimization) sharing a unified LLMOps spine on Databricks. I'll show you exactly what I got wrong in the database layer, what Lakebase changes, and how the alternatives stack up for teams building enterprise AI on Databricks.

The Problem: Agents Don't Use Databases the Way Humans Do

Traditional database architecture assumes a human-paced world. Applications write transactions. Dashboards read. ETL pipelines shuttle data between the OLTP and OLAP layers. The entire stack was designed around predictable access patterns and a well-understood divide between operational and analytical data.

Agents shatter all three of those assumptions simultaneously.

They're inherently ephemeral. A swarm agent coordinating a supply chain analysis spins up, decomposes a task across five specialist agents, writes hundreds of state checkpoints, and terminates — all in under thirty seconds. The next invocation may run on a completely different thread with zero shared context from the prior session. Legacy databases aren't built for disposable, bursty compute that needs to scale to zero between workloads and spin back up instantly for the next one.

They generate massive, high-frequency state churn. Every tool call, reasoning step, context retrieval, and handoff between agents is a potential checkpoint. For a multi-turn swarm agent handling a complex analytical task, that's hundreds of writes per session — each requiring exact-ID retrieval by thread_id or session_id, not vector similarity search. Postgres handles this natively. A Delta table, even a well-ZORDER'd one, adds overhead for an access pattern it was never designed to serve.

They need to reach analytical data without crossing a platform boundary. An agent recommending inventory adjustments needs to query the Gold Delta tables — the same tables your ML models trained on, governed by the same Unity Catalog policies your data engineering team enforces. If your OLTP layer lives outside Databricks, you're building a data copy pipeline just so your agent can read data that's already on the platform.

That third problem is where I went wrong.

The Mistake I Made Building the Agentic Platform

When I built the Swarm Coordination module of our Agentic Intelligence Platform, I used a Unity Catalog Delta table as the shared persistent memory store for multi-turn agent sessions. Delta was a reasonable first choice — it gave me time travel for session debugging, UC lineage on every agent write, and the ability to query session history in SparkSQL.

But Delta is an OLAP-optimized storage format. When the coordinator agent needed to retrieve the exact current state for a specific thread_id, it was running a scan-optimized query engine against a point-lookup workload. I added ZORDER on (session_id, turn_number) and tuned file sizes — which helped. But it was always the wrong tool for the access pattern.

What the architecture actually needed was a clean separation of concerns:

Short-term session state (checkpoints, thread context, current turn, handoff records) → a transactional store with exact-ID retrieval and sub-10ms read latency
Long-term episodic memory (past session summaries, cross-session reasoning patterns, performance analytics) → Delta Lake, where batch SparkSQL queries and Lakehouse Monitoring make sense

Lakebase is the transactional half of that equation. And it's the piece I didn't have.

What Lakebase Provides for Agentic Systems

Lakebase is Databricks' fully managed, serverless PostgreSQL database — built on the Neon architecture (which Databricks acquired) and integrated natively into the Databricks platform. It reached General Availability in February 2026. Here are the capabilities that directly change the agent architecture:

Native LangGraph Checkpointing

Lakebase is a supported LangGraph checkpointer backend on both Databricks Apps and Model Serving endpoints. Authentication between your application and Lakebase is resolved automatically through the platform's Service Principal — no credential management in application code, no secret rotation for a separate database connection string.

from langgraph.checkpoint.postgres import PostgresSaver
from databricks.sdk import WorkspaceClient

# Databricks resolves authentication automatically via Service Principal
w = WorkspaceClient()
conn_str = w.lakebase.get_connection_string(instance_name="agent-state-prod")

# LangGraph Postgres checkpointer backed by Lakebase
checkpointer = PostgresSaver.from_conn_string(conn_str)

# The agent now has durable, OLTP-grade session state
agent = create_react_agent(model, tools, checkpointer=checkpointer)

This is the pattern you'd apply to the Swarm Coordination module. The coordinator's session state — which agent it's routing to, which specialist has already responded, the current confidence score — lives in Lakebase. The MLflow Trace of the full execution graph is separate (logged as a Databricks artifact). Two different concerns, two different stores, each doing what it does best.

Instant Database Branching for Agent Experimentation

This is the capability that directly addresses the "4x more databases" pattern. Lakebase supports copy-on-write branching: a full, isolated branch of a production-scale database in under one second, at near-zero initial storage cost (only diffs are written on change).

For agents, this changes what's possible:

A Causal Optimization agent running counterfactual "what-if" scenarios can branch the intervention state, explore the outcome, and discard the branch — without any risk to the production state
An agent autonomously testing schema migrations can branch, run the migration, validate, and either promote or roll back in a single API call
Development environments for agent workflows are ephemeral by default, provisioned and torn down programmatically

Databricks telemetry shows production Lakebase deployments averaging roughly 10 branches per database project, with some agent-driven workflows reaching hundreds of nested iterations. That pattern is structurally impossible with traditional managed Postgres where creating a copy requires duplicating the full storage filesystem.

Autoscaling with Scale-to-Zero

Agent workloads are bursty in a way that application workloads rarely are. Thousands of concurrent sessions during business hours, complete silence at 2am. Lakebase scales its compute up under load and down to zero between workloads — costs align with actual usage, not provisioned capacity. For multi-agent platforms running on Databricks Apps, this means the transactional backend matches the compute model of the application layer itself.

Managed Delta Sync — The ETL Eliminator

Every write to Lakebase is automatically synced to Delta tables in Unity Catalog. For agent systems, this is what closes the long-term memory loop without custom code:

Agent session checkpoints (short-term) → Lakebase → automatic Delta sync → Gold layer for analysis
Lakehouse Monitoring can track agent reasoning drift, latency patterns, and success rate from the Delta-synced inference data
The grid operations team in our Solar Forecasting project needed low-latency reads on Gold forecast data — we built a data copy pipeline as a workaround that added latency and a maintenance surface.

Unity Catalog as the Single Governance Layer

Lakebase instances are registered in Unity Catalog under the same 3-level namespace as your Delta tables and ML models. The same row-level security policies, column masking, lineage graphs, and access audit logs that govern energy_nz.solar.gold also govern the Lakebase instance storing agent session state. For enterprise AI systems operating under regulatory oversight, this is a structural requirement — not a preference.

The Comparison: Why the Alternatives Fall Short for Databricks-Native Agentic Systems

Supabase is an excellent platform for its target use case. Postgres, auth, storage, real-time subscriptions, and edge functions bundled into a working backend in minutes — at $25/month, it's exceptionally competitive for early-stage web applications. But for enterprise agentic systems on Databricks, there are two structural gaps that don't close with configuration: there is no Unity Catalog (agents operating on governed enterprise data need the same governance layer as the data itself), and there is no Lakehouse sync (analytical data still requires an ETL pipeline to reach Supabase, and Supabase data requires an ETL pipeline to reach the Lakehouse for monitoring and ML). Supabase asks you to build and maintain that bridge. Lakebase eliminates it.

Azure Database for PostgreSQL Flexible Server is a solid choice for traditional Azure-native transactional workloads. But compute and storage are coupled together — creating an isolated development copy of a production database requires duplicating the full storage volume, an operation measured in hours and charged by the gigabyte. There is no native database branching, no Lakehouse sync, and the governance model (Azure RBAC) is entirely separate from Unity Catalog. For teams building on Azure Databricks who want a single governance boundary across OLTP, OLAP, and ML — this means managing two different access control systems with no native bridge between them.

Azure Cosmos DB is purpose-built for globally distributed, multi-region, flexible-schema NoSQL workloads — a genuinely different problem from agentic state management. It's not PostgreSQL-compatible, which means LangGraph's Postgres checkpointer doesn't apply, standard psycopg2 drivers don't connect, and the document model doesn't naturally represent the relational shape of session checkpoints and handoff records. Cosmos DB is the right answer for a different question.

What I'd Rebuild in the Agentic Platform

With Lakebase available, the architecture for the three modules changes specifically:

Module 1 — Swarm Coordination:

Coordinator checkpoint store → Lakebase: thread state, current turn context, handoff records, confidence scores per routing decision. LangGraph Postgres checkpointer on Databricks Apps, authentication via Service Principal.
Agent episodic memory → Delta Lake (unchanged): cross-session analytical queries, SHAP analysis across sessions, Lakehouse Monitoring on reasoning patterns. Lakebase managed sync keeps Delta current automatically.

Module 2 — Ontology-Based Reasoning:

Ontology triples → Delta (unchanged): batch reads by the re-ranking gate, SQL queries for sub-graph retrieval. No change needed here — this is an OLAP access pattern.
Grounding cache → Lakebase: frequently accessed ontology sub-graphs cached in Postgres for sub-50ms retrieval during the agent's inner reasoning loop.

Module 3 — Causal Optimization:

Intervention results → Lakebase → managed Delta sync: causal engine writes intervention outcomes (high-frequency, transactional) to Lakebase. Sync pushes results to the Gold Delta layer for downstream analytics without custom ETL.
Causal DAG structure → Delta (unchanged): the DAG (edges, confidence scores, version history) is read by batch retraining jobs after PSI-triggered re-learning. Delta time travel for DAG versioning is already the right pattern here.

The net effect: short-term transactional operations at Postgres latency, long-term analytical operations at Delta scale, a single Unity Catalog governance layer across both, and zero custom ETL pipelines connecting them.

When Lakebase Isn't the Answer

A credible recommendation has boundaries. Lakebase is not the right choice when:

Your OLTP workload is genuinely independent of analytics — a standalone web app with no ML components or Lakehouse integration doesn't benefit from the co-location.
You need niche Postgres extensions not yet supported in Lakebase's managed environment (specialized GIS, custom time-series extensions).
You're building a consumer-facing mobile application where Supabase's bundled auth, storage, and real-time subscriptions are the actual product value.
You're not on Databricks. The Lakehouse integration is the primary differentiation — without it, Lakebase is a well-engineered managed Postgres, but not a category-defining choice.

The decision criterion is simple: how close is your agent workload to your Databricks analytics and ML stack? The closer it is, the more Lakebase earns its place.

The Larger Picture

Databricks started as the platform where you process and model data. Unity Catalog is the platform where you govern data. Lakebase makes it the platform where you run transactional applications on that data — without copying it, without bridging governance models, without maintaining a second operational stack alongside your analytics stack.

The 4x database creation stat isn't a curiosity. It's a forcing function. When agents provision databases at that rate, every architectural inefficiency — the manual provisioning, the ETL pipeline, the separate governance model — compounds at agent speed. Human architects designed those inefficiencies in; agents will expose them.

After rebuilding the Agentic Platform architecture mentally with Lakebase in place, the change is not additive — it's structural. It's the difference between three systems (OLTP, OLAP, ML) connected by pipelines you maintain, and one platform where those boundaries exist only in your mental model.

If this resonated, I'd welcome your thoughts in the comments — especially if you've hit the OLTP/OLAP boundary problem in your own agentic architectures. What did your workaround look like?