cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

API Consumption on Databricks

rathorer
New Contributor III

In this blog, I will be talking about the building the architecture to serve the API consumption on Databricks Platform. I will be using Lakebase approach for this. It will be useful for this kind of API requirement.

API Requirement: 

Performance:

  • Current Response Time: 20–30 ms with average 14K calls/day.
  • Target Response Time (Future): Maintain <100 ms even at peak load of 70–80K calls/day with full roll-out.

Scalability:

  • Must support a 4–5x increase in daily call volume

Availability:

  • Maintain 99.9% uptime
  • Implement retry logic and circuit breakers for resilience

Security:

  • Continue using OAuth 2.0 client credentials flow for consumers
  • Ensure APIM uses Managed Identity for backend access
  • Enforce IP restrictions and private endpoints

Monitoring & Observability:

  • Track latency, error rates, and throughput via Azure Monitor / App Insights
  • Set up alerts for spikes or failures

There is an assumption that Data processing pipeline is already created to process the data either on NRT or through Batch Processing. 

rathorer_0-1759990971452.png

This is kind of high-level flow assumed that Data platform would have all processing done based on NRT or Batch. This stage table could be gold table/ MV/ Live Table. API table is created to make sure that API does not hit to any of the base table and can reside in another Subnet as if required for Security policy. It can be directly populated from Silver layer itself and proper security by having the API specific user access on this table. This Final API table would be having millions of records.

Process Flow: 

Final Delta API Table (Approx Million(s) records)

                     ↓

Lakehouse Federation + Lakebase (Powered by Delta Caching + Photon + Materialized Views)

                     ↓

APIM (OAuth2 + Managed Identity) → APIs (served via Lakebase endpoint)

 

rathorer_0-1760078458113.png

rathorer_1-1760078513459.png

A. Prereqs (one-time)

  1. Unity Catalog (UC):
  • Create a metastore, attach your workspace.
  • Set up external location for ADLS and storage credential.
  • Create catalogs: one for API table as current one is residing is separate subnet and another for all other Databricks processing.
  1. Networking (private only):
  • Private Endpoint to ADLS Gen2.
  • Private Link to Databricks workspace and Databricks SQL (Serverless) endpoints.
  • Keep VNET peering between Data VNET and Integration VNET (APIM).
  • Lock down egress with NSGs / UDRs if required.

  1. Identity:
  • System-assigned Managed Identity on Azure Function/AKS and APIM.
  • Create Entra ID App (service principal) for Databricks SQL access or use MI → Entra ID → Databricks OBO (recommended).
  • In Databricks, create a user for that identity (SCIM) and grant least-privilege.
  1. Databricks SQL Warehouse (Serverless):
  • Photon: ON
  • Min/Max clusters: start 2–4 (autoscale), concurrency 20–50
  • Channel: Current
  • Spot: OK for dev, not for prod
  • Enable Serverless Result Cache (default)

B. Lakehouse tables (Final Databricks table)

Final (Gold) table should live in UC:

USE CATALOG gold;
USE SCHEMA api; 

-- Create or Upgrade to a Delta table optimized for point lookups
CREATE TABLE IF NOT EXISTS customer_summary
(
  id STRING NOT NULL,                     -- primary lookup key for API
  customer_id STRING,
  cen_id STRING,
  churn_score DOUBLE,
  business_units ARRAY<STRING>,
  updated_at TIMESTAMP
)
USING DELTA
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/gold/api/customer_summary'; 

-- Enable auto-optimization & liquid clustering
ALTER TABLE customer_summary SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true',
  'delta.feature.liquidClustering' = 'supported',
  'delta.liquidClustering.autoMode.enabled' = 'true'
); 

-- Optional: Bloom filter index for high-selectivity key lookups
CREATE BLOOMFILTER INDEX bf_customer_summary_id
ON TABLE customer_summary
FOR COLUMNS(id) OPTIONS (fpp = 0.05, numItems = 30000000);

C. Connecting APIM/Function to Databricks SQL (Managed Identity, Private Link)

1) APIM Policy (OAuth2 client credentials)

APIM continues to validate OAuth2 token from consumers (as you have now).
 Add retry + circuit breaker at APIM:

<policies>
  <inbound>
    <base />
    <validate-jwt header-name="Authorization" failed-validation-httpcode="401" require-scheme="Bearer">
      <openid-config url="https://login.microsoftonline.com/<tenant-id>/v2.0/.well-known/openid-configuration" />
      <audiences>
        <audience>api://your-api-app-id</audience>
      </audiences>
    </validate-jwt>
     <!-- Simple circuit breaker -->
   <circuit-breaker id="cb-db" failure-ratio="0.5" sampling-duration="60" minimum-operations="20" break-duration="30" />
    <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)" count="2" interval="0.2" />
  </inbound>
  <backend>
    <base />
  </backend>
  <outbound>
    <base />
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>

2) Azure Function (Python) → Databricks SQL with MSI

Use Managed Identity to get an AAD token and connect with the databricks-sql-connector (or ODBC) over Private Link to the SQL Warehouse:

# requirements.txt
# azure-identity==1.15.0
# databricks-sql-connector==3.1.0
# tenacity==8.2.3
 import os
from azure.identity import ManagedIdentityCredential
from databricks import sql
from tenacity import retry, stop_after_attempt, wait_exponential
 DATABRICKS_SQL_ENDPOINT = os.environ["DBSQL_ENDPOINT"]      # e.g. adb-<id>
<region>.azuredatabricks.net
WAREHOUSE_ID            = os.environ["DBSQL_WAREHOUSE_ID"]
CATALOG                 = "gold"
SCHEMA                  = "api"
 # get AAD access token for Databricks
def aad_token():
    cred = ManagedIdentityCredential()
    # scope for Databricks AAD
    token = cred.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default")  # Databricks AAD resource ID
    return token.token
 
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.1, min=0.2, max=1))
def get_customer(customer_id:str):
    with sql.connect(
        server_hostname=DATABRICKS_SQL_ENDPOINT,
        http_path=f"/sql/1.0/warehouses/{WAREHOUSE_ID}",
        access_token=aad_token(),            # AAD token from MSI
        _user_agent_entry="api-lakebase"
    ) as conn:
        with conn.cursor() as cur:
            cur.execute(f"USE CATALOG {CATALOG}; USE SCHEMA {SCHEMA};")
            cur.execute(
                "SELECT id, customer_id, cen_id, churn_score, business_units, updated_at "
                "FROM customer_summary WHERE id = ? LIMIT 1",
                (customer_id,)
            )
            row = cur.fetchone()
            return None if row is None else dict(zip([c[0] for c in cur.description], row))

3) Meeting API SLOs

Throughput math
70–80K calls/day = ~1 req/sec sustained, bursts ~5–20 req/sec.

  • Serverless SQL + Photon easily handles tens to hundreds of QPS for point lookups on a well-clustered Delta table.
    Latency (<100 ms target)
  • Photon + liquid clustering + result cache → 10–50 ms typical for single-row lookups.
  • Keep payloads small (project only needed columns).
  • Warm the result cache with a tiny “health” query every minute.

Availability (99.9%)

  • Serverless SQL Warehouses provide high availability.
  • Add APIM retry + circuit breaker (above) and Function retry (Tenacity).
  • Provision min 2 clusters for steady-state capacity; allow autoscale to 4–8.

4) Performance & process improvements (the knobs)

Data layout

  • Liquid Clustering (auto) ON (you already chose this).
  • Small-file compaction: nightly OPTIMIZE.
  • Bloom filter on id (optional but helpful for point lookups).
  • Target file size: 128–256 MB (Delta optimizer manages this with auto-compact).

Warehouse

  • Photon: ON (critical).
  • Serverless preferred (lower cold-starts, managed).
  • Concurrency: Start 20–50.
  • Autoscale: min 2 / max 4 (raise if you see queueing).
  • Result cache: default ON; don’t disable.
  • Channel: Current.
  • Keep-alive pings from Function every few minutes to reduce cold-starts.

Caching

  • Delta cache (on compute) helps when the same partitions hit frequently.
  • For strictly random single-key lookups, rely on Photon + metadata pruning + result cache.

Query design

  • Use parameterized SELECT … WHERE id = ? LIMIT 1.
  • **No SELECT ***. Project minimal columns.
  • Keep predicates SARGable (no function on the left side).

API level

  • Tight timeouts (e.g., 1–2 s) and retries 2x with backoff.
  • Bulk endpoints (optional): allow querying up to N ids per call to reduce chattiness.

5) Security (Existing, pointing to Databricks)

  • OAuth 2.0 client credentials at APIM (unchanged).
  • Managed Identity from APIM/Function to Databricks SQL (token exchange via Entra ID).
  • Private Endpoints to:
    • ADLS Gen2
    • Databricks Workspace
    • Databricks SQL endpoint (new)
  • IP restrictions at APIM and Function subnets.
  • UC RBAC on the gold.api.customer_summary table (least privilege).
  • No public ingress to data plane.

6) Monitoring & observability

Azure side

  • App Insights on Function/AKS: track latency, p50/p95, error %, dependency calls to Databricks.
  • APIM Analytics: backend latency, 4xx/5xx, throttle events.
  • Alerts: 5xx spike, latency p95 > target, throttling > threshold.

Databricks side

  • SQL Warehouse → “Query history” & “Statement execution” metrics; set alerts for queue time, duration.
  • System tables (UC): system.query.history, system.access.audit.
  • Diagnostic settings streaming to Log Analytics (KQL dashboards).

KQL sketch (APIM)

AzureDiagnostics
| where ResourceType == "APIM"
| summarize p50=percentile(DurationMs,50), p95=percentile(DurationMs,95), errors=countif(ResponseCode >= 500) by bin(TimeGenerated, 5m)

Optimization Summary: 

Optimization

Read Throughput (approx)

Photon + Delta Cache

10,000+ QPS (primary key)

Liquid Clustering Enabled

Consistent <100ms response

Materialized View + Cache

MV can be enabled only if the need of Aggregation and joins in API

~5–30ms response

 

Recommendations Summary:

Layer

Recommendation

Data Storage

Delta Lake with Liquid Clustering (Auto mode)

Query Engine

Databricks Photon-enabled SQL Warehouse (Serverless preferred)

Acceleration

Delta Caching + Materialized Views

Federation

Expose via Lakebase

API Layer

Azure APIM + OAuth2 + Managed Identity

Monitoring

Azure Monitor + App Insights + Query History

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now