a week ago
In this blog, I will be talking about the building the architecture to serve the API consumption on Databricks Platform. I will be using Lakebase approach for this. It will be useful for this kind of API requirement.
API Requirement:
Performance:
Scalability:
Availability:
Security:
Monitoring & Observability:
There is an assumption that Data processing pipeline is already created to process the data either on NRT or through Batch Processing.
This is kind of high-level flow assumed that Data platform would have all processing done based on NRT or Batch. This stage table could be gold table/ MV/ Live Table. API table is created to make sure that API does not hit to any of the base table and can reside in another Subnet as if required for Security policy. It can be directly populated from Silver layer itself and proper security by having the API specific user access on this table. This Final API table would be having millions of records.
Process Flow:
Final Delta API Table (Approx Million(s) records)
↓
Lakehouse Federation + Lakebase (Powered by Delta Caching + Photon + Materialized Views)
↓
APIM (OAuth2 + Managed Identity) → APIs (served via Lakebase endpoint)
Final (Gold) table should live in UC:
USE CATALOG gold;
USE SCHEMA api;
-- Create or Upgrade to a Delta table optimized for point lookups
CREATE TABLE IF NOT EXISTS customer_summary
(
id STRING NOT NULL, -- primary lookup key for API
customer_id STRING,
cen_id STRING,
churn_score DOUBLE,
business_units ARRAY<STRING>,
updated_at TIMESTAMP
)
USING DELTA
LOCATION 'abfss://<container>@<account>.dfs.core.windows.net/gold/api/customer_summary';
-- Enable auto-optimization & liquid clustering
ALTER TABLE customer_summary SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.feature.liquidClustering' = 'supported',
'delta.liquidClustering.autoMode.enabled' = 'true'
);
-- Optional: Bloom filter index for high-selectivity key lookups
CREATE BLOOMFILTER INDEX bf_customer_summary_id
ON TABLE customer_summary
FOR COLUMNS(id) OPTIONS (fpp = 0.05, numItems = 30000000);
APIM continues to validate OAuth2 token from consumers (as you have now).
Add retry + circuit breaker at APIM:
<policies>
<inbound>
<base />
<validate-jwt header-name="Authorization" failed-validation-httpcode="401" require-scheme="Bearer">
<openid-config url="https://login.microsoftonline.com/<tenant-id>/v2.0/.well-known/openid-configuration" />
<audiences>
<audience>api://your-api-app-id</audience>
</audiences>
</validate-jwt>
<!-- Simple circuit breaker -->
<circuit-breaker id="cb-db" failure-ratio="0.5" sampling-duration="60" minimum-operations="20" break-duration="30" />
<retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)" count="2" interval="0.2" />
</inbound>
<backend>
<base />
</backend>
<outbound>
<base />
</outbound>
<on-error>
<base />
</on-error>
</policies>
Use Managed Identity to get an AAD token and connect with the databricks-sql-connector (or ODBC) over Private Link to the SQL Warehouse:
# requirements.txt
# azure-identity==1.15.0
# databricks-sql-connector==3.1.0
# tenacity==8.2.3
import os
from azure.identity import ManagedIdentityCredential
from databricks import sql
from tenacity import retry, stop_after_attempt, wait_exponential
DATABRICKS_SQL_ENDPOINT = os.environ["DBSQL_ENDPOINT"] # e.g. adb-<id>
<region>.azuredatabricks.net
WAREHOUSE_ID = os.environ["DBSQL_WAREHOUSE_ID"]
CATALOG = "gold"
SCHEMA = "api"
# get AAD access token for Databricks
def aad_token():
cred = ManagedIdentityCredential()
# scope for Databricks AAD
token = cred.get_token("2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default") # Databricks AAD resource ID
return token.token
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=0.1, min=0.2, max=1))
def get_customer(customer_id:str):
with sql.connect(
server_hostname=DATABRICKS_SQL_ENDPOINT,
http_path=f"/sql/1.0/warehouses/{WAREHOUSE_ID}",
access_token=aad_token(), # AAD token from MSI
_user_agent_entry="api-lakebase"
) as conn:
with conn.cursor() as cur:
cur.execute(f"USE CATALOG {CATALOG}; USE SCHEMA {SCHEMA};")
cur.execute(
"SELECT id, customer_id, cen_id, churn_score, business_units, updated_at "
"FROM customer_summary WHERE id = ? LIMIT 1",
(customer_id,)
)
row = cur.fetchone()
return None if row is None else dict(zip([c[0] for c in cur.description], row))
Data layout
Warehouse
Caching
Query design
API level
Azure side
Databricks side
KQL sketch (APIM)
AzureDiagnostics
| where ResourceType == "APIM"
| summarize p50=percentile(DurationMs,50), p95=percentile(DurationMs,95), errors=countif(ResponseCode >= 500) by bin(TimeGenerated, 5m)
Optimization Summary:
Optimization | Read Throughput (approx) |
Photon + Delta Cache | 10,000+ QPS (primary key) |
Liquid Clustering Enabled | Consistent <100ms response |
Materialized View + Cache MV can be enabled only if the need of Aggregation and joins in API | ~5–30ms response |
Recommendations Summary:
Layer | Recommendation |
Data Storage | Delta Lake with Liquid Clustering (Auto mode) |
Query Engine | Databricks Photon-enabled SQL Warehouse (Serverless preferred) |
Acceleration | Delta Caching + Materialized Views |
Federation | Expose via Lakebase |
API Layer | Azure APIM + OAuth2 + Managed Identity |
Monitoring | Azure Monitor + App Insights + Query History |
Wednesday
Sharp design choices, @rathorer! Appreciate you sharing this detailed architecture.
Wednesday
Thanks @Advika
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now