cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Multi-tenant recommendation system (Machine learning)

Kasen
New Contributor III

Hello,

I am looking to build a multi-tenant machine learning recommender system in Azure Databricks. The idea is to have a single shared model, where each tenant can use the same model to train on their own unique dataset. Essentially, while the model architecture remains the same for all tenants, the data used for training and inference would be specific to each one. Any resources that I can refer or best practices for implementing such a system? Thank you!

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

@Kasen , sorry for the delayed response.  Here are some things to consider regarding your question.

 

Azure Databricks is well-suited for a shared-architecture, tenantโ€‘isolated recommender system. Below is a pragmatic blueprint, the isolation model options, and concrete best practices with Databricks-native services you can adopt.
 

Recommended multi-tenant architecture on Azure Databricks

  • Use Unity Catalog (UC) as the governance backbone with a single metastore per region and isolate tenants at the catalog or schema level (preferred over multiple metastores).
  • Bind catalogs and storage credentials to specific workspaces if you need environment isolation (e.g., dev vs prod and tenant-specific endpoints) while retaining centralized governance across the region.
  • Run shared compute safely with Lakeguard to enforce data governance at runtime on multi-user clusters and SQL warehouses; this lets you share cost-efficient compute without relaxing isolation controls.
  • For cost attribution and noisy-neighbor avoidance, prefer compute-per-tenant (dedicated job clusters or per-tenant serverless concurrency) even if data governance is centralized in UC.

Isolation controls and governance

  • Use catalog-per-tenant (preferred) or schema-per-tenant in a shared workspace; both patterns give strong isolation with simpler operations than workspace-per-tenant (250 workspace hard limit).
  • Apply workspaceโ€“catalog binding and credential binding to workspaces to constrain where production data is accessible and to segment endpoints and identities per environment or tenant.
  • Leverage row/columnโ€‘level security and ABAC for finer-grained controls where needed; UC supports policy-based filtering and masking across governed tables.

Feature engineering and serving

  • Use Databricks Feature Store in Unity Catalog to register feature tables and models with governance, lineage, and cross-workspace discovery; training automatically tracks feature lineage, and inference can autoโ€‘lookup features to prevent training/serving skew.
  • For low-latency online inference, enable Online Feature Stores (Lakebaseโ€‘powered) and publish perโ€‘tenant feature tables (latest values or full time series as needed).

Model lifecycle per tenant

  • Keep a single model architecture (e.g., Twoโ€‘Tower retrieval plus DLRM reโ€‘ranking) and register each tenantโ€™s model/version in UC under that tenantโ€™s catalog/schema using MLflow.
  • For scalable training, use TorchDistributor with Mosaic StreamingDataset (and TorchRec for sharded embeddings) to handle millions of users/items efficiently on multiโ€‘GPU clusters/serverless GPU.
  • If youโ€™re earlier in the journey, Databricks solution accelerators provide wideโ€‘andโ€‘deep, ALS, marketโ€‘basket, image similarity notebooks to bootstrap tenant builds on a common codebase.

Inference, A/B testing, and monitoring

  • Serve tenant models with Mosaic AI Model Serving. You can either deploy one endpoint per tenant or use a multiโ€‘model endpoint (served_entities) with traffic splitting to route perโ€‘tenant traffic or run challenger vs current for A/B tests.
  • For highโ€‘QPS/lowโ€‘latency tenants, enable route optimization (dedicated URL + OAuth) to reduce overhead latency and raise QPS versus standard endpoints.
  • Turn on AI Gateway usage tracking and inference tables for each endpoint to log requests/responses to a UC Delta table for evaluation, drift monitoring, and corpus creation for fineโ€‘tuning or reโ€‘rankers.
  • Apply rate limits (endpoint, user, group) to protect shared capacity across tenants; monitor limits and regions with the Serving limits/regions guide.

Cross-region or cross-organization sharing

  • Keep one UC metastore per region; share data across regions/orgs with Databricksโ€‘toโ€‘Databricks Delta Sharing (foreign catalogs), noting lineage/ACLs donโ€™t cross the share boundary and must be reโ€‘applied in the recipient.
  • If you need governed open sharing to external tools (e.g., Power BI), use OIDC federation for Delta Sharing to avoid longโ€‘lived bearer tokens and retain MFA/IdP policy enforcement.

Cost, quotas, and limits

  • Treat compute as the attribution layer (perโ€‘tenant clusters/concurrency), and use serverless budget policies and tags for granular billing.
  • Review UC quotas and request increases if needed (e.g., large numbers of catalogs, tables, or models per tenant) with the UC quota SOP.
  • Check Model Serving limits (QPS, payload, concurrency, compliance) and route optimization requirements when designing endpoints at scale.

External access patterns and guardrails

  • Avoid external systems writing to the same tables outside Databricks, as UC doesnโ€™t govern direct objectโ€‘store writes; use managed tables or explicit externalโ€‘volume patterns and credential vending to preserve consistency and security.

Concrete blueprint (step-by-step)

  • Identity and governance: Provision principals via SCIM at the account, enable UC, create a catalog per tenant, and bind catalogs/credentials to the correct workspaces and environments (dev/stg/prod).
  • Data ingestion and isolation: Land each tenantโ€™s data into their catalog/schema, applying RLS/CLS or ABAC where needed; use Lakeguard on shared compute clusters to enforce governance at runtime.
  • Feature engineering: Build tenant feature tables in UC, track lineage, and publish hot features to Online Feature Stores for low-latency inference.
  • Model training: Use common repos/notebooks with TorchDistributor/Mosaic Streaming for Twoโ€‘Tower retrieval and DLRM reranking; register each tenantโ€™s model in UC (same architecture, different weights), tracked by MLflow.
  • Model serving: Create per-tenant endpoints or multiโ€‘model endpoints with traffic split and route optimization; enable AI Gateway usage tracking, rate limits, and inference tables for monitoring and A/B testing.
  • Cross-region access (optional): Use D2D Delta Sharing and reโ€‘grant ACLs in the recipient catalog; donโ€™t attempt crossโ€‘region metastore assignment.

Resources to read and use

  • What is Unity Catalog and Azure UC best practices (metastore per region, isolation at catalog/schema, workspace binding).
  • Isolation in Multiโ€‘Tenant Applications (catalog/schema vs workspace per tenant; compute-per-tenant guidance).
  • Unity Catalog Lakeguard overview for multi-user governance on shared compute.
  • Feature Store in UC and Online Feature Stores (setup, auto feature lookup, online serving patterns).
  • Model Serving docs: create endpoints, multiโ€‘model traffic splitting, route optimization, usage tracking, inference tables, limits/regions.
  • Delta Sharing architecture and OIDC federation (crossโ€‘region/org data sharing patterns).
  • Recommender systems on Databricks: Twoโ€‘Tower, DLRM, wideโ€‘andโ€‘deep, ALS, accelerators and blogs.
 
Hope this helps, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now