Databricks Community

rathorer · ‎10-10-2025

In this blog, I will be talking about the building the architecture to serve the API consumption on Databricks Platform. I will be using Lakebase approach for this. It will be useful for this kind of API requirement.

API Requirement:

Performance:

Current Response Time: 20–30 ms with average 14K calls/day.
Target Response Time (Future): Maintain <100 ms even at peak load of 70–80K calls/day with full roll-out.

Scalability:

Must support a 4–5x increase in daily call volume

Availability:

Maintain 99.9% uptime
Implement retry logic and circuit breakers for resilience

Security:

Continue using OAuth 2.0 client credentials flow for consumers
Ensure APIM uses Managed Identity for backend access
Enforce IP restrictions and private endpoints

Monitoring & Observability:

Track latency, error rates, and throughput via Azure Monitor / App Insights
Set up alerts for spikes or failures

There is an assumption that Data processing pipeline is already created to process the data either on NRT or through Batch Processing.

This is kind of high-level flow assumed that Data platform would have all processing done based on NRT or Batch. This stage table could be gold table/ MV/ Live Table. API table is created to make sure that API does not hit to any of the base table and can reside in another Subnet as if required for Security policy. It can be directly populated from Silver layer itself and proper security by having the API specific user access on this table. This Final API table would be having millions of records.

Process Flow:

Final Delta API Table (Approx Million(s) records)

↓

Lakehouse Federation + Lakebase (Powered by Delta Caching + Photon + Materialized Views)

↓

APIM (OAuth2 + Managed Identity) → APIs (served via Lakebase endpoint)

A. Prereqs (one-time)

Unity Catalog (UC):

Create a metastore, attach your workspace.
Set up external location for ADLS and storage credential.
Create catalogs: one for API table as current one is residing is separate subnet and another for all other Databricks processing.

Networking (private only):

Private Endpoint to ADLS Gen2.
Private Link to Databricks workspace and Databricks SQL (Serverless) endpoints.
Keep VNET peering between Data VNET and Integration VNET (APIM).
Lock down egress with NSGs / UDRs if required.

Identity:

System-assigned Managed Identity on Azure Function/AKS and APIM.
Create Entra ID App (service principal) for Databricks SQL access or use MI → Entra ID → Databricks OBO (recommended).
In Databricks, create a user for that identity (SCIM) and grant least-privilege.

Databricks SQL Warehouse (Serverless):

Photon: ON
Min/Max clusters: start 2–4 (autoscale), concurrency 20–50
Channel: Current
Spot: OK for dev, not for prod
Enable Serverless Result Cache (default)

B. Lakehouse tables (Final Databricks table)

Final (Gold) table should live in UC:

USE CATALOG gold;
USE SCHEMA api;

-- Create or Upgrade to a Delta table optimized for point lookups
CREATE TABLE IF NOT EXISTS customer_summary
(
id STRING NOT NULL, -- primary lookup key for API
customer_id STRING,
cen_id STRING,
churn_score DOUBLE,
business_units ARRAY<STRING>,
updated_at TIMESTAMP
)
USING DELTA
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/gold/api/customer_summary';

-- Enable auto-optimization & liquid clustering
ALTER TABLE customer_summary SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.feature.liquidClustering' = 'supported',
'delta.liquidClustering.autoMode.enabled' = 'true'
);

-- Optional: Bloom filter index for high-selectivity key lookups
CREATE BLOOMFILTER INDEX bf_customer_summary_id
ON TABLE customer_summary
FOR COLUMNS(id) OPTIONS (fpp = 0.05, numItems = 30000000);

C. Connecting APIM/Function to Databricks SQL (Managed Identity, Private Link)

1) APIM Policy (OAuth2 client credentials)

APIM continues to validate OAuth2 token from consumers (as you have now).
Add retry + circuit breaker at APIM:

<policies>
<inbound>
    <base />
    <validate-jwt header-name="Authorization" failed-validation-httpcode="401" require-scheme="Bearer">
      <openid-config url="https://login.microsoftonline.com/<tenant-id>/v2.0/.well-known/openid-configuration" />
      <audiences>
        <audience>api://your-api-app-id</audience>
      </audiences>
    </validate-jwt>
     
<circuit-breaker id="cb-db" failure-ratio="0.5" sampling-duration="60" minimum-operations="20" break-duration="30" />
    <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)" count="2" interval="0.2" />
</inbound>
<backend>
    <base />
</backend>
<outbound>
    <base />
</outbound>
<on-error>
    <base />
</on-error>
</policies>

2) Azure Function (Python) → Databricks SQL with MSI

Use Managed Identity to get an AAD token and connect with the databricks-sql-connector (or ODBC) over Private Link to the SQL Warehouse:

# requirements.txt
# azure-identity==1.15.0
# databricks-sql-connector==3.1.0
# tenacity==8.2.3
import os
from azure.identity import ManagedIdentityCredential
from databricks import sql
from tenacity import retry, stop_after_attempt, wait_exponential
DATABRICKS_SQL_ENDPOINT = os.environ["DBSQL_ENDPOINT"]      # e.g. adb-<id>
<region>.azuredatabricks.net
WAREHOUSE_ID            = os.environ["DBSQL_WAREHOUSE_ID"]
CATALOG                 = "gold"
SCHEMA                  = "api"
# get AAD access token for Databricks
def aad_token():
    cred = ManagedIdentityCredential()
    # scope for Databricks AAD
    token = cred.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default") # Databricks AAD resource ID
    return token.token

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.1, min=0.2, max=1))
def get_customer(customer_id:str):
    with sql.connect(
        server_hostname=DATABRICKS_SQL_ENDPOINT,
        http_path=f"/sql/1.0/warehouses/{WAREHOUSE_ID}",
        access_token=aad_token(),            # AAD token from MSI
        _user_agent_entry="api-lakebase"
    ) as conn:
        with conn.cursor() as cur:
            cur.execute(f"USE CATALOG {CATALOG}; USE SCHEMA {SCHEMA};")
            cur.execute(
                "SELECT id, customer_id, cen_id, churn_score, business_units, updated_at "
                "FROM customer_summary WHERE id = ? LIMIT 1",
                (customer_id,)
            )
            row = cur.fetchone()
            return None if row is None else dict(zip([c[0] for c in cur.description], row))

3) Meeting API SLOs

Throughput math
70–80K calls/day = ~1 req/sec sustained, bursts ~5–20 req/sec.

Serverless SQL + Photon easily handles tens to hundreds of QPS for point lookups on a well-clustered Delta table.
Latency (<100 ms target)

Photon + liquid clustering + result cache → 10–50 ms typical for single-row lookups.
Keep payloads small (project only needed columns).
Warm the result cache with a tiny “health” query every minute.

Availability (99.9%)

Serverless SQL Warehouses provide high availability.
Add APIM retry + circuit breaker (above) and Function retry (Tenacity).
Provision min 2 clusters for steady-state capacity; allow autoscale to 4–8.

4) Performance & process improvements (the knobs)

Data layout

Liquid Clustering (auto) ON (you already chose this).
Small-file compaction: nightly OPTIMIZE.
Bloom filter on id (optional but helpful for point lookups).
Target file size: 128–256 MB (Delta optimizer manages this with auto-compact).

Warehouse

Photon: ON (critical).
Serverless preferred (lower cold-starts, managed).
Concurrency: Start 20–50.
Autoscale: min 2 / max 4 (raise if you see queueing).
Result cache: default ON; don’t disable.
Channel: Current.
Keep-alive pings from Function every few minutes to reduce cold-starts.

Caching

Delta cache (on compute) helps when the same partitions hit frequently.
For strictly random single-key lookups, rely on Photon + metadata pruning + result cache.

Query design

Use parameterized SELECT … WHERE id = ? LIMIT 1.
**No SELECT ***. Project minimal columns.
Keep predicates SARGable (no function on the left side).

API level

Tight timeouts (e.g., 1–2 s) and retries 2x with backoff.
Bulk endpoints (optional): allow querying up to N ids per call to reduce chattiness.

5) Security (Existing, pointing to Databricks)

OAuth 2.0 client credentials at APIM (unchanged).
Managed Identity from APIM/Function to Databricks SQL (token exchange via Entra ID).
Private Endpoints to:
- ADLS Gen2
- Databricks Workspace
- Databricks SQL endpoint (new)
IP restrictions at APIM and Function subnets.
UC RBAC on the gold.api.customer_summary table (least privilege).
No public ingress to data plane.

6) Monitoring & observability

Azure side

App Insights on Function/AKS: track latency, p50/p95, error %, dependency calls to Databricks.
APIM Analytics: backend latency, 4xx/5xx, throttle events.
Alerts: 5xx spike, latency p95 > target, throttling > threshold.

Databricks side

SQL Warehouse → “Query history” & “Statement execution” metrics; set alerts for queue time, duration.
System tables (UC): system.query.history, system.access.audit.
Diagnostic settings streaming to Log Analytics (KQL dashboards).

KQL sketch (APIM)

AzureDiagnostics
| where ResourceType == "APIM"
| summarize p50=percentile(DurationMs,50), p95=percentile(DurationMs,95), errors=countif(ResponseCode >= 500) by bin(TimeGenerated, 5m)

Optimization Summary:

Optimization	Read Throughput (approx)
Photon + Delta Cache	10,000+ QPS (primary key)
Liquid Clustering Enabled	Consistent <100ms response
Materialized View + Cache MV can be enabled only if the need of Aggregation and joins in API	~5–30ms response

Recommendations Summary:

Layer	Recommendation
Data Storage	Delta Lake with Liquid Clustering (Auto mode)
Query Engine	Databricks Photon-enabled SQL Warehouse (Serverless preferred)
Acceleration	Delta Caching + Materialized Views
Federation	Expose via Lakebase
API Layer	Azure APIM + OAuth2 + Managed Identity
Monitoring	Azure Monitor + App Insights + Query History

Advika · ‎10-15-2025

Sharp design choices, @rathorer! Appreciate you sharing this detailed architecture.

rathorer · ‎10-15-2025

Thanks @Advika

smithsonian · 2 weeks ago

Great post @rathorer

Can you explain your Lakebase implementation? I understand Lakebase is the Managed PostGres implementation for OLTP (from the Neon acquisition) but not clear the Photon with the Lakebase.

Thanks Venkat

SurajW · a week ago

Insightful @rathorer

Brahmareddy · a week ago

Insightful and informative!

Databricks Community

API Consumption on Databricks

A. Prereqs (one-time)

B. Lakehouse tables (Final Databricks table)

C. Connecting APIM/Function to Databricks SQL (Managed Identity, Private Link)

1) APIM Policy (OAuth2 client credentials)

2) Azure Function (Python) → Databricks SQL with MSI

3) Meeting API SLOs

Throughput math
70–80K calls/day = ~1 req/sec sustained, bursts ~5–20 req/sec.

Availability (99.9%)

4) Performance & process improvements (the knobs)

5) Security (Existing, pointing to Databricks)

6) Monitoring & observability

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Databricks Community

API Consumption on Databricks

A. Prereqs (one-time)

B. Lakehouse tables (Final Databricks table)

C. Connecting APIM/Function to Databricks SQL (Managed Identity, Private Link)

1) APIM Policy (OAuth2 client credentials)

2) Azure Function (Python) → Databricks SQL with MSI

3) Meeting API SLOs

Throughput math70–80K calls/day = ~1 req/sec sustained, bursts ~5–20 req/sec.

Availability (99.9%)

4) Performance & process improvements (the knobs)

5) Security (Existing, pointing to Databricks)

6) Monitoring & observability

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Throughput math
70–80K calls/day = ~1 req/sec sustained, bursts ~5–20 req/sec.