cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Wait, Did Databricks Just Put Git Inside My Database?

AbhaySingh
Databricks Employee
Databricks Employee

Wait, Did Databricks Just Put Git Inside My Database?

If you've been scratching your head at Lakebase's "branching" feature wondering "am I working with a database or GitHub?"—you're not alone. Let me break down what's actually happening here, because once it clicks, it changes how you think about database development entirely.

Prerequisites

Before we dive in, make sure you've got:

  • A Databricks workspace on AWS (Lakebase Autoscaling is AWS-only for now)
  • Your workspace in a supported region: us-east-1, us-east-2, eu-central-1, eu-west-1, eu-west-2, ap-south-1, ap-southeast-1, or ap-southeast-2
  • Lakebase Autoscaling Preview enabled by your workspace admin
  • A Lakebase project created (takes about 30 seconds)

So What Actually IS Lakebase Branching?

Here's the thing that confused me initially: Lakebase isn't Delta Lake. It's a fully managed PostgreSQL database. So when we talk about "branching," we're not talking about Delta's transaction log or time travel—we're talking about something fundamentally different.

When you create a Lakebase project, you automatically get two branches: production (your root/default branch) and development (a child of production). From there, you can create child branches from any existing branch, building out a hierarchy that looks eerily familiar to anyone who's used Git:

production (root - protected)
├── staging
│   └── feature-payments
└── development
    ├── dev-alice
    └── dev-bob

Each branch is a fully independent PostgreSQL database environment. Changes you make in dev-alice don't touch dev-bob, and neither affects production. It's complete isolation—but without the hours of data copying you'd normally need.

The Copy-on-Write Magic (Why It's Instant)

Okay, so how does a 200GB database branch in 3 seconds? The answer is copy-on-write storage, and it's actually pretty elegant.

When you create a branch, Lakebase doesn't duplicate your data. Instead, the new branch just gets pointers to the same underlying storage as its parent. Think of it like Git's branching—the branch itself is essentially free because it's just referencing the same data.

BEFORE ANY CHANGES:
                                     
production branch      dev branch (just created)
┌──────────────┐      ┌──────────────┐
│ users table  │◄─────│ → pointer    │
│ orders table │◄─────│ → pointer    │  
│ products     │◄─────│ → pointer    │
└──────────────┘      └──────────────┘
     (actual data)        (no storage cost yet)

AFTER MODIFYING users TABLE IN dev:

production branch      dev branch
┌──────────────┐      ┌──────────────┐
│ users table  │      │ users table' │ ← only this is new storage
│ orders table │◄─────│ → pointer    │
│ products     │◄─────│ → pointer    │
└──────────────┘      └──────────────┘

The implications are huge:

  1. Branch creation is instant regardless of database size—10GB or 10TB, doesn't matter
  2. You only pay for what changes—modify 1GB in a 100GB database, you pay for ~1GB extra storage
  3. Zero performance hit on production during branching

Creating Your First Branch

You can create branches through the Lakebase App UI, Python SDK, Java SDK, CLI, or REST API. Here's the UI workflow:

  1. Open the Lakebase App from the apps switcher
  2. Navigate to your project's Branches page
  3. Click Create branch
  4. Give it a name (I like the pattern dev/yourname for personal branches)
  5. Choose your source—current data or a point in time
  6. Click Create

That's it. A few seconds later, you've got a complete branch with its own compute endpoint and connection string.

If you prefer code, here's the Python SDK approach:

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.postgres import Branch, BranchSpec, Duration

w = WorkspaceClient()

# Create a branch with 7-day expiration
branch_spec = BranchSpec(
    ttl=Duration(seconds=604800),  # 7 days in seconds
    source_branch="projects/my-project/branches/production"
)

branch = Branch(spec=branch_spec)

result = w.postgres.create_branch(
    parent="projects/my-project",
    branch=branch,
    branch_id="dev-alice"
).wait()

print(f"Branch created: {result.name}")

Or via CLI:

databricks postgres create-branch projects/my-project dev-alice \
  --json '{
    "spec": {
      "source_branch": "projects/my-project/branches/development",
      "ttl": "604800s"
    }
  }'

Pro tip: You can set branches to auto-expire. UI presets are 1 hour, 1 day, or 7 days. Via API, you can set any TTL up to 30 days max—perfect for CI/CD test branches that should clean themselves up.

Once You're IN the Branch, It's Just Postgres

Here's where it feels normal again. Once you connect to your branch (each branch has its own connection string), you're just running standard PostgreSQL:

-- Create a new table for your feature
CREATE TABLE user_preferences (
    id SERIAL PRIMARY KEY,
    user_id INT REFERENCES users(id),
    theme VARCHAR(50) DEFAULT 'light',
    notifications BOOLEAN DEFAULT true,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Add a column to an existing table
ALTER TABLE users ADD COLUMN last_login TIMESTAMP;

-- Create an index
CREATE INDEX idx_users_last_login ON users(last_login);

-- Insert test data
INSERT INTO user_preferences (user_id, theme) 
VALUES (1, 'dark'), (2, 'light'), (3, 'dark');

No special syntax. No branch-aware commands. Just write your migrations like you always have.

The Elephant in the Room: There's No Merge

Alright, here's the part that'll trip you up if you're expecting full Git semantics: Lakebase doesn't have native merge functionality. You can't just click a button to promote changes from dev-alice back to development.

The docs are clear: "To move changes from child to parent, use your standard migration tools."

So what's the actual workflow? It looks like this:

  1. Develop and test your schema changes on your feature branch
  2. Use Schema Diff (more on this in a sec) to review exactly what changed
  3. Apply those same migrations to the parent branch using your migration framework—Alembic, Prisma, Django migrations, Flyway, whatever you use
  4. Reset or delete your feature branch

It's not as seamless as Git merge, but honestly? It forces good migration hygiene. You can't get lazy and just "merge and hope"—you have to actually track your schema changes properly.

Schema Diff: Your Pre-Merge Safety Net

This feature partially makes up for the lack of merge. Schema Diff lets you compare the DDL between any two branches, showing exactly what's different.

To use it:

  1. Navigate to your child branch in the Lakebase App
  2. Click Schema diff in the Parent branch section
  3. Select your base branch (defaults to parent) and database
  4. Click Compare

You'll get a side-by-side view: red for removed/changed from base, green for added/changed in your branch. It captures tables, columns, constraints, indexes—all your schema objects.

I've started using this before EVERY migration promotion. It's caught a few "oops, I didn't mean to add that column" moments.

Branch Reset: The One-Way Refresh

Branch reset instantly updates a child branch to match its parent's current state. Key word: one-way. Parent to child only.

When would you use this?

  • Starting fresh on a new feature after finishing the previous one
  • Your dev branch has drifted too far from production
  • You want the latest production data without creating a new branch

A few gotchas here:

  • Reset is a complete overwrite—any local changes are gone
  • Existing connections are temporarily interrupted (but reconnect automatically with the same connection details)
  • Root branches (like production) can't be reset since they have no parent
  • Can't reset branches that have children—delete the children first

Point-in-Time Branching: Your "Oh No" Recovery Button

This is one of my favorite features. You can create a branch from any point within your restore window (configurable from 2 to 35 days).

Real scenario: Someone ran a DELETE FROM orders WHERE status = 'pending' without a WHERE order_date < ... clause. Poof—three days of orders gone.

With point-in-time branching:

  1. Create a new branch set to 10 minutes before the disaster
  2. Connect to that branch
  3. Query and extract the missing records
  4. Insert them back into production

No calling support. No restoring from backups. Just branch, query, fix.

My Development Workflow (What Actually Works)

After a few weeks of using this, here's the pattern that's clicked for my team:

production (protected)
└── development
    ├── dev/alice
    ├── dev/bob
    └── dev/charlie

Each dev gets their own long-lived branch off development. We reset them weekly to stay reasonably current with shared changes.

The workflow:

  1. Reset your branch from development to start fresh
  2. Develop your feature—write migrations, test with real-ish data
  3. Schema Diff against development to review changes
  4. Apply migrations to development using our standard Alembic workflow
  5. Test on development with the team
  6. Repeat the migration promotion to production
  7. Reset your personal branch and start the next feature

For CI/CD, we spin up ephemeral branches with short expiration:

import uuid
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.postgres import Branch, BranchSpec, Duration

w = WorkspaceClient()
branch_id = f"ci-{uuid.uuid4().hex[:8]}"

# Create 2-hour ephemeral branch
branch_spec = BranchSpec(
    ttl=Duration(seconds=7200),  # 2 hours
    source_branch="projects/my-project/branches/development"
)
branch = Branch(spec=branch_spec)

result = w.postgres.create_branch(
    parent="projects/my-project",
    branch=branch,
    branch_id=branch_id
).wait()

# Run integration tests against branch endpoint
# Branch auto-deletes after 2 hours

No more fighting over shared test databases. No more "who left test data in staging?"

How This Differs from Delta Lake Time Travel

I keep getting asked this, so let's clear it up:

  Lakebase Branching Delta Lake Time Travel
What it is PostgreSQL with copy-on-write Delta table version history
Workload OLTP (transactional) OLAP (analytical)
Scope Entire database Individual tables
Can you write? Yes, full read/write No, historical versions are read-only
Syntax SDK/CLI/API for branches, SQL inside VERSION AS OF, TIMESTAMP AS OF

Delta time travel lets you read historical states. Lakebase branching gives you a writable, isolated copy that starts from a point in time. Very different use cases.

Limits You'll Actually Hit

Before you go branch-crazy, here are the limits that matter:

Resource Limit
Branches per project 500
Unarchived branches 10 (this one bites people)
Concurrent active computes 20 (default branch exempt)
Databases per branch 500
Roles per branch 500
Max data size per branch 8 TB
Protected branches 1 per project
Root branches 3 per project
History retention (restore window) 2-35 days
Branch expiration max 30 days

The unarchived branches limit of 10 is the one that catches teams off guard. If you're spinning up lots of dev branches, inactive ones get archived automatically. Protected branches and default branches are exempt from archival.

The Unity Catalog Connection

You can register Lakebase databases in Unity Catalog for unified governance and cross-source queries. But here's the important part: Unity Catalog catalogs are read-only mirrors.

You can query your Lakebase data from Databricks SQL alongside your Lakehouse tables:

-- Join OLTP data with analytics
SELECT 
    o.order_id,
    o.customer_id,
    c.lifetime_value,
    c.churn_risk_score
FROM lakebase_catalog.public.orders o
JOIN main.analytics.customer_360 c 
    ON o.customer_id = c.customer_id
WHERE o.order_date > CURRENT_DATE - INTERVAL '7 days';

But if you want to write to Lakebase, you need to connect directly to your branch endpoint. Also note: each branch requires separate catalog registration, and metadata syncs have caching—new objects may need a manual refresh to appear.

Final Thoughts

Lakebase branching isn't trying to replace Git for your code—it's bringing the same mental model to your data layer. The ability to instantly create isolated, writable copies of your database changes the development calculus entirely.

No more waiting for database clones. No more "hope this migration doesn't break prod." No more shared test environments where everyone's stepping on each other.

The lack of native merge is a real limitation, but it's one that forces you to treat migrations as first-class citizens anyway—which is probably what you should've been doing all along.

Give it a shot on your next feature. Spin up a personal branch, break something on purpose, reset, and try again. Once you experience that workflow, going back to "let me clone the database real quick" feels like the stone age.


Got questions about Lakebase branching? Drop them in the comments. And if you've figured out a clever workflow I haven't thought of, I'm all ears.

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

@AbhaySingh , 

This was a fun read — and a great way to spark discussion about what “Git inside my database” really means in practice.

From what I’m seeing in the product world, Databricks isn’t literally putting Git inside the storage engine of your tables — it’s bringing Git workflows directly into the workspace UX so your notebooks, SQL queries, dashboards and other artifacts live in Git folders/Repos and you can branch, commit, push and pull without context-switching out of Databricks. 

That shift has a ton of practical value for teams that want classic software engineering best practices — feature branches, CI/CD, collaboration — but it’s also worth grounding expectations a bit: the Git integration is fundamentally a workspace-level source control layer, not a metadata/time-travel layer over the data in your tables. In other words, you’re not versioning your Delta lake like a Git object store here — you’re versioning the code and queries you write against it.

Curious to hear how folks are using this in real projects — especially around branching strategies and managing merge workflows across notebooks and SQL.

Cheers, Louis! 🚀