Databricks Community

zach_goehring

Operationalize Your Lakehouse: Lakebase for Low-Latency Apps & APIs

The Databricks Data Intelligence Platform unifies data, AI, and governance so organizations can put all of their data to work. Until recently, though, operational workloads still lived outside the platform — requiring separate databases, duplicated data, and manual ETL pipelines to power applications.

Databricks Lakebase changes that. It’s a fully managed, PostgreSQL database that extends the Lakehouse to low-latency, transactional use cases. By bringing OLTP and OLAP together on one platform, Lakebase unlocks new real-time application patterns, including:

Low-latency reads of governed data for external applications and APIs
Applications and APIs that need a transactional data store for inserts and updates
Storing application state alongside analytical data for unified governance
Feature serving for ML models that require fast feature lookups
Application logic enhanced by Postgres extensions such as:

pg_graphql – expose data through a GraphQL API
pg_trgm and fuzzystrmatch – implement similarity search and fuzzy matching

In short: your data isn’t just for analytics anymore — it can now power your apps, APIs, and ML models in real time.

How Lakebase Unifies Data Serving

The diagram above illustrates how Databricks Lakebase extends the Lakehouse from analytics into operational data serving.

On the left, we have the familiar Medallion Architecture:
Bronze → Silver → Gold tables represent the curation path of raw data into high-quality, governed datasets inside Unity Catalog. These tables power analytical workloads like BI dashboards, machine learning models, and AI agents.

On the right, Lakebase is a fully managed Postgres environment. Databricks provides a managed sync pipeline that automatically replicates selected tables into Lakebase as “Synced Tables.” This enables applications to perform low-latency reads—on the order of tens of milliseconds—without the need for custom ETL pipelines or separate databases.

Tip: In Databricks, the terms database and schema are used interchangeably inside a catalog. In Postgres (and therefore in Lakebase), a database is a higher-level container — and each database contains multiple schemas. So while a Postgres database isn’t exactly the same as a Databricks catalog, that’s the closest conceptual equivalent. It sits at the “top-level” within a Lakebase instance.

Synced Tables are read-only replicas of your Lakehouse tables, refreshed through Databricks-managed sync pipelines.

You can also create Postgres-native tables directly in Lakebase. These handle inserts, updates, and deletes—ideal for managing application state, session data, or event logs alongside analytical data.

Finally, Lakebase can also be federated into Unity Catalog for metadata visibility, lightweight querying, and permission management through Unity Catalog. While this doesn’t replicate the underlying data, it provides a simple way to explore and query what’s stored in Lakebase without leaving Databricks.

Lakebase in Practice: Field Notes & Lessons Learned

Now that we’ve seen how Lakebase fits into the broader Data Intelligence Platform, let’s walk through a few best practices, tips, and code snippets that will save you a lot of time for when you want to sit down and try this out. Here’s what I’ve learned in the field working on Lakebase with customers:

# 1 — Lakebase Primarily Uses Oauth Tokens for Authentication

One of the first things you’ll run into with Lakebase is authentication. Lakebase uses short-lived OAuth tokens for database access — typically valid for only one hour. That’s great for security, but not so great when you’re trying to debug queries in pgAdmin.

Use the Built-In Lakebase Query Editor for Development:
For quick tests, use the PostgreSQL query editor built right into Databricks. We authenticate you automatically so you don’t need to fetch a token. It’s the easiest way to validate your synced tables, run GRANTs, or inspect schemas. The PostgreSQL editor is a bit tucked away so here’s how to get to it:

Navigate to your Lakebase Instance
Click New Query in the top right
Choose the right Lakebase Database and Schema in the breadcrumbs across the top

Use Databricks Oauth tokens, not EntraID tokens
If you’re on Azure, you may be tempted to generate EntraID/Azure tokens for your Service Principal. At the time of this writing, these will not work - they have to be Databricks oauth tokens

Our official documentation has great examples of how to refresh tokens for applications programmatically
If you’re connecting from Databricks App, check out the Databricks Apps Cookbook: connect to Lakebase

Use native Postgres logins when Oauth is not an option
Oauth should be your first choice for authenticating to Lakebase. However, If you have workloads that can’t rotate tokens, you can configure your Lakebase instance to support native Postgres roles. This will allow you to create a persistent password that can be used to authenticate to Lakebase.

# 2 - Manage Access Inside Lakebase (and Avoid Permission Surprises)

If you want your application — or a group of Databricks users — to query data in Lakebase, you first need to authorize them inside the Postgres database itself. While Lakebase integrates with Unity Catalog for Databricks identities, permissions are still enforced locally within Postgres when querying through Postgres interfaces, which you will be doing to get low latency.

This is one of the most common sources of confusion for new users:

You can see the table in Unity Catalog ✅
You can query it in a Databricks SQL editor ✅
But when you connect via Postgres → ❌ permission denied

That’s because Postgres permissions live inside the Lakebase instance itself — not in Unity Catalog. So, even though identities are federated from Databricks, authorization must be explicitly set in Lakebase

There are 3 steps:

Enable Databricks identity support inside the database
Create a role that maps to a Databricks user, group, or Service Principal
Issue Postgres grants on the schema/table

You can run the commands below directly in the Lakebase Query Editor (or through a driver):

-- create databricks_auth extension
CREATE EXTENSION IF NOT EXISTS databricks_auth;

-- Create a role for a Databricks identity
-- USER or GROUP or SERVICE_PRINCIPAL should align with your Databricks identity
SELECT databricks_create_role('app_service_principal', 'SERVICE_PRINCIPAL');

-- Grant permissions
GRANT USAGE ON SCHEMA demo TO "app_service_principal";
GRANT SELECT ON TABLE demo.orders_synced TO "app_service_principal";

# 3 - Performance: Design for Access Patterns

If you’re bringing data into Lakebase, it’s almost always because you need low-latency lookups. In relational systems like Postgres, indexes are how you get sub-10ms reads—but only when they match your data access pattern.

When a Delta table is synced into Lakebase:

Lakebase automatically indexes the primary key (if you specify one in the sync spec).
No other indexes are created by default. If your app filters on other columns (email, device_id, status, etc.), you must add those indexes yourself.

Let’s see how this works with an example. I’ve created a synced table, users_synced with 1M rows where user_id is the Primary Key. Using EXPLAIN ANALYZE, we can see the query plan and query execution time.

Query Using the Primary Key (Indexed Automatically)

EXPLAIN ANALYZE
SELECT * FROM "demo_pg_db"."lakebase_demo"."users_synced"
WHERE user_id = 'USR-750000';

-- Query Plan
-- Index Scan using "__db_tmp_fc33a6d0-81e6-4229-b28a-e27f7f7e3034_pkey" on partition_16594 users_synced (cost=0.42..8.44 rows=1 width=63) (actual time=0.039..0.040 rows=1 loops=1) -- Index Cond: (user_id = 'USR-750000'::text) 
-- Planning Time: 0.541 ms 
-- Execution Time: 0.063 ms

From the query plan, we can see that Lakebase is performing an index scan and we’re getting lightning fast responses.

Query Using a Non-Indexed Column

Now, lets try querying with a non-index field (email):

EXPLAIN ANALYZE
SELECT * FROM "demo_pg_db"."lakebase_demo"."users_synced"
WHERE email = 'user_750000@example.com';

-- QUERY PLAN
-- Gather  (cost=1000.00..18358.43 rows=1 width=63) (actual time=0.269..83.424 rows=1 loops=1)
--   Workers Planned: 2
--   Workers Launched: 2
--   ->  Parallel Seq Scan on partition_16594 users_synced  (cost=0.00..17358.33 rows=1 width=63) (actual time=46.714..72.565 rows=0 loops=3)
--         Filter: (email = 'user_750000@example.com'::text)
--         Rows Removed by Filter: 333333
-- Planning Time: 0.405 ms
-- Execution Time: 83.446 ms

Because there was no index on email, Lakebase had to scan the entire table. 83 ms might be acceptable for some apps but the latency grows quickly as data scales.

Query Using a Manually Created Index (email)

CREATE INDEX IF NOT EXISTS idx_users_email 
ON "demo_pg_db"."lakebase_demo"."users_synced" (email);

EXPLAIN ANALYZE
SELECT * FROM "demo_pg_db"."lakebase_demo"."users_synced"
WHERE email = 'user_400000@example.com';

-- QUERY PLAN
-- Index Scan using partition_16594_email_idx on partition_16594 users_synced  (cost=0.42..8.44 rows=1 width=63) (actual time=0.063..0.064 rows=1 loops=1)
--   Index Cond: (email = 'user_400000@example.com'::text)
-- Planning Time: 0.524 ms
-- Execution Time: 0.082 ms

After adding the index, we can see we’re able to perform an index scan and our query execution time drops from 83ms to just 0.082ms.

You can also create composite indexes for queries with multiple filters.

CREATE INDEX idx_users_email_status
ON "demo_pg_db"."lakebase_demo"."users_synced" (email, status);

Indexes accelerate reads, but they’re not free. Every index in Lakebase is a physical structure stored alongside your data. They also add overhead to your writes and syncs. So, use indexes to boost performance, but only index what you query.

# 4 - Connecting Applications to Lakebase

If you’re exploring Lakebase, your journey probably doesn’t stop at the Databricks UI. Most teams want a custom application or API that gives end users a fast, interactive way to work with governed data.

This is where Lakebase comes to life: it brings the reliability and governance of the Lakehouse to operational experiences, enabling sub-second reads and transactional writes directly from your applications.

Once you’ve set up OAuth authentication and Lakebase-side permissions (as covered above), your app connects to Lakebase just like any Postgres database — through a standard driver or ORM such as SQLAlchemy or psycopg. Make sure to reference the documentation for examples on rotating tokens programmatically and connection pooling.

If your application is hosted within Databricks Apps, there’s an even simpler path to integrate with Lakebase. Databricks apps are tightly integrated into the Databricks Data Intelligence Platform - and Lakebase fits naturally into that ecosystem. If your application is hosted through Databricks Apps, you can fast-forward much of the setup required to connect to Lakebase.

When you declare a Lakebase instance as a resource for your Databricks App, Databricks automatically:

Provisions a Postgres role tied to your app’s Service Principal
Injects environment variables such as PGDATABASE, PGHOST, and PGUSER directly into the app runtime

That means you don’t have to manage role creation or connection information - it’s automatically handled when the app deploys.

Oauth token rotation is still your responsibility. Your application still needs to periodically request and refresh new tokens. The Databricks Apps cookbook includes examples for:

Implementing automatic token rotation
Using frameworks like FastAPI with Lakebase
Querying Lakebase tables

When deploying through a Databricks Asset Bundle (DAB), you can promote your app and Lakebase resource definitions together across environments (dev -> stage -> prod). Here’s an example of what the apps section in your DAB YAML file would look like:

apps:
  my_lakebase_app:
    name: "my_lakebase_app"
    description: "databricks app with Lakebase"
    source_code_path: ./app
    resources:
      - name: lakebase-instance
        database:
          instance_name: ${resources.database_instances.lakebase_instance.name}
          database_name: ${var.pg_db_name}
          permission: CAN_CONNECT_AND_CREATE

This pattern ensures consistency across deployments - the same app, the same Lakebase configuration, just parameterized for each environment.

# 5 - Tag Resources Early for Cost Observability

Just like jobs, clusters, and SQL warehouses, Lakebase costs show up in Databricks system tables, but Lakebase combines multiple components:

Cost Category	SKU	How to Attribute It
Lakebase Compute	DATABASE_SERVERLESS_COMPUTE	Tag the Lakebase instance
Lakebase Storage	DATABRICKS_STORAGE	Tag the Lakebase instance
Sync Pipeline Compute	JOBS_SERVERLESS_COMPUTE	Tag the Synced Table

Without tagging, Lakebase costs show up as generic database spend. Tagging lets you answer questions like:

“How much are we spending on Lakebase compute in dev vs prod?”
“How much does syncing each table into Lakebase cost?”

You can break down Lakebase usage through the Unity Catalog system.billing.usage table:

SELECT
  SUM(usage_quantity) AS total_usage_quantity,
  sku_name,
  usage_date
FROM system.billing.usage
WHERE billing_origin_product = 'DATABASE'
  AND (
    sku_name LIKE '%DATABASE_SERVERLESS_COMPUTE%' OR
    sku_name LIKE '%DATABRICKS_STORAGE%' OR
    sku_name LIKE '%JOBS_SERVERLESS_COMPUTE%'
  )
  AND usage_date >= CURRENT_DATE - INTERVAL 1 DAY
GROUP BY sku_name, usage_date

Closing Thoughts

Lakebase isn’t just “Postgres on Databricks.” It’s a shift in how we think about operational and analytical data living together — governed, queryable, and powered by the same platform.

Instead of standing up a separate database, building reverse ETL jobs, and managing auth in two places, you can now:

Sync governed Lakehouse tables into Postgres with millisecond query times
Build application state and transactional tables directly in Lakebase
Secure everything using Databricks identities and Postgres roles
Tune performance with the same tools you'd use in any relational system (indexes, EXPLAIN plans, etc.)
Attribute compute, storage, and sync costs cleanly back to teams and workloads

Success with Lakebase isn’t about just turning it on — it’s about designing with access patterns in mind, getting auth right, and indexing only what your apps truly query.

Databricks Community

Operationalize Your Lakehouse: Lakebase for Low-Latency Apps & APIs

Operationalize Your Lakehouse: Lakebase for Low-Latency Apps & APIs

How Lakebase Unifies Data Serving

Lakebase in Practice: Field Notes & Lessons Learned

# 1 — Lakebase Primarily Uses Oauth Tokens for Authentication

# 2 - Manage Access Inside Lakebase (and Avoid Permission Surprises)

# 3 - Performance: Design for Access Patterns

# 4 - Connecting Applications to Lakebase

# 5 - Tag Resources Early for Cost Observability

Closing Thoughts

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks