cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
JianWu
Databricks Employee
Databricks Employee

The Machine Learning Development Life Cycle (MLDLC) involves several phases, from data collection to model deployment and monitoring. Unity Catalog (UC) is a powerful tool that can help streamline and govern each step in this process, ensuring that the data is well-organized, accessible, and traceable throughout the lifecycle. This blog will outline how Unity Catalog can be used in each phase of the MLDLC, with design considerations, best practices, and additional features like lineage tracking, asset discovery, user collaboration and security audit.

 

What is the Machine Learning Development Life Cycle (MLDLC)?

The MLDLC encompasses all stages of machine learning, from initial data collection and preprocessing to model deployment and monitoring. The key phases include:

  1. Data Ingestion and Preparation
  2. Feature Engineering
  3. Model Training and Tuning
  4. Model Deployment and Monitoring

Each phase is crucial in developing accurate and scalable machine learning models. Unity Catalog can assist in managing data, ensuring consistency, and improving collaboration across teams during all phases of the MLDLC.

 

What is Unity Catalog (UC)?

Unity Catalog (link) is a centralized metadata service that helps govern data access and ensure consistency across various Databricks workspaces. It provides a single control plane to manage, discover, and track data assets across multiple environments, helping organizations ensure that their data and model pipelines are efficient, secure, and compliant.

With Unity Catalog, organizations can manage data and AI assets in a structured, scalable manner. This organization of data and AI assets helps locate and access resources quickly and provides the governance needed to comply with data protection regulations and internal policies. Efficient access and scalability in Unity Catalog allow data scientists, engineers, and business stakeholders to focus on use case development rather than searching for data, ultimately saving time and improving productivity.

Below is the high-level Unity Catalog features mapping to each phase of ML development lifecycle. 

jwdbx_0-1747320151707.png

Figure 1: Unity Catalog for ML Development Lifecycle

 

How Unity Catalog Helps in Each Phase of the MLDLC

Design Considerations

Before implementing Unity Catalog in your ML development pipeline, it’s essential to understand the broader design principles that influence how you organize data and AI assets. Unity Catalog’s 3-level namespace (link) — catalog > schema > table/volume/model — provides a flexible way to support varied use cases, teams, and security requirements. Below are some universal considerations to help guide the design:

General Guidelines 

  • Team Size
    • Large teams may need finer-grained organization, such as a schema per business unit or project, while smaller teams may share schemas like bronze, silver, and gold.
  • Complexity of Projects
    • For projects with many models or advanced ML needs, dedicate schemas per project or use approaches like a wrapper to manage multiple models under one abstraction.
  • Access Levels and Permissions
    • Not all users need access to everything. Unity Catalog allows you to define access control at the catalog, schema, or asset level to ensure secure usage.
  • Models, Functions, and Features
    • Treat models as first-class citizens in Unity Catalog — stored alongside functions and feature tables. Use aliases like “Champion” and “Challenger” to control deployment without overwriting versions.
  • Discoverability
    • Use tags and consistent naming conventions to help teams discover relevant assets. This is especially useful in cross-functional teams and large-scale environments.

 

Applying the Guidelines: Example Design for Fraud Detection

To bring the principles above into focus, here’s an example of how you could apply them in a fraud detection ML use case:

  1. Catalog Level 
    • Environment Segregation: Use separate catalogs for dev, staging, and prod to maintain a clear separation between experimental, testing, and production data.
  2. Schema Level
    • Medallion Architecture: Within each catalog, define Bronze (raw data), Silver (cleaned data), and Gold (enriched/feature) schemas to ensure data transformation follows a structured pipeline. A use case-specific schema like fraud_detection can store all machine learning assets relevant to fraud detection, providing modularity and reusability.
  3. Feature Table
    • Centralized and Reusable: Store commonly used feature tables such as user_features and product_features under the Gold schema, allowing multiple ML models to reuse them.
  4. Model
    • Versioning and Governance: Register ML models (e.g., fraud_clf for fraud classifier) in the Fraud Detection Schema, enabling version control and controlled deployment.
  5. Volume
    • Handling Non-Tabular Data: Store unstructured data (e.g., images, text logs) in Volumes under the appropriate schema, ensuring they are accessible for feature engineering.
  6. Function
    • On-Demand Feature Computation: Define functions like compute_distance in the Fraud Detection Schema, enabling models to calculate real-time distance-based features.

jwdbx_1-1747320151798.png

Figure 2: Sample catalog organization

 

Data Ingestion and Preparation

Unity Catalog plays a central role in data ingestion by ensuring raw data assets are easily discoverable, well-organized, and protected with proper access controls. UC ensures that data scientists and engineers can quickly find and access the data they need for preprocessing.

Organizing Data and AI Assets: Unity Catalog's structure helps ensure that data tables and AI models are logically grouped under different catalogs and schemas, making it easy for teams to find specific datasets and models they need. You can define business_unit_dev, staging, and prod catalogs to store different environments and avoid conflicts between datasets in development and those in production.

Tabular and Non-tabular Data Management: In UC, tables and volumes (link) govern tabular and non-tabular (like images or text) datasets, while schemas group related objects (tables, volumes, even models) together.

Access Controls and Permissions: UC provides granular access control (link) at the catalog, schema, and table levels, helping ensure that only authorized users can access sensitive datasets. This also supports compliance with data privacy and security regulations, as organizations can easily set access policies for different datasets and ML assets.

 

Feature Engineering

As we move to the Feature Engineering (link) phase, Unity Catalog continues to provide key benefits. Feature tables in UC are designed to serve as the foundation for training models. A feature table is a dataset created explicitly for model training, containing relevant features that help improve the performance of ML models.

Structure of Feature Tables: UC allows teams to create versioned feature tables for different stages of the ML pipeline. For example, the fraud detection use case (in the diagram above) may require multiple feature tables such as user_features, product_features, and fraud_detection_features. These feature tables can be stored in the gold schema to signify that they are refined and ready for model training.

Functionality of Functions: UC allows using Python UDFs to compute features on demand. For example, a compute_distance function could be created in the fraud detection model to calculate location-based features at inference time. Functions in Unity Catalog can thus provide real-time or batch computations critical for dynamic feature engineering.

 

Model Training and Tuning

After completing data transformation and feature engineering, the next phase involves selecting suitable algorithms, training models, and tuning hyperparameters. This process requires iterative experimentation to optimize model performance, at which point model management, deployment become critical considerations.

Model management

Unity Catalog delivers a unified solution for machine learning models management by providing a hosted MLflow Model Registry that supports model registration, versioning, and flexible deployment through model aliases, making it easier to reference and promote models across environments (link).

Centralized access control and governance at the metastore level ensure that only authorized users can access or modify registered models, with consistent permissions applied across all workspaces (link).

Using three-level namespaces, Unity Catalog enables organizations to either create separate registered models for different environments or seamlessly promote a single model across development, staging, and production (link), all while maintaining robust governance (link).

Its native integration with Databricks Feature Store further streamlines the process by tracking features used in model training and ensuring they are readily available for inference, thereby simplifying model deployment and updates

 

Model Deployment and Monitoring

Once the best-performing model has been identified, it should be deployed to production and continuously monitored to detect any model drift. This proactive monitoring enables timely retraining when model drift occurs. Let’s explore how Unity Catalog can facilitate both model deployment and ongoing model monitoring.

Model Deployment

With Unity Catalog enabled, You can use Databricks Model Serving (link) for deploying classical  ML models, Generative AI models, and AI agents. You can deploy those models using API/SDK/Databricks UI. 

Using model aliases (link) makes it easy to deploy your model, you can give a name to a specific version of a model. This helps you track which version is currently being used in production. For example, you can label the current production model as "Champion." When you want to update the production model, you just need to assign the "Champion" label to a new version. This way, any workloads using the "Champion" model will automatically switch to the new version.

Model Monitoring 

Once your ML model is deployed, you need to set up a continuous monitoring mechanism to ensure the model's quality to meet business requirements over time. Inference table (link) within Unity Catalog provides an easy way to monitor and debug your models, where it automatically captures incoming requests and outgoing responses for a model serving endpoint and logs them as a Unity Catalog Delta table. You can use the data in this table to monitor, debug, and improve ML models.

jwdbx_3-1747320151795.png

Figure 3: Typical workflow for inference table

 

Other Considerations:

Lineage, discoverability, collaboration and audit are applicable throughout the machine learning lifecycle, so it’s important to know how to achieve this leveraging Unity Catalog. 

Lineage

When you train your model in Unity Catalog, you can track the lineage of the model to the upstream dataset with which it was trained and evaluated on, Unity Catalog maintains a record of model lineage, including the origin and modifications of models, facilitating transparency and reproducibility.

You can track model lineage via UI (link)  or programmatically (link). 

jwdbx_4-1747320151771.png

Figure 4: Sample Model and Data Lineage

Discoverability

Unity Catalog allows you to discover ML models both In Catalog Explorer and MLFlow API (link). With Catalog Explorer, you can view model schema details, preview sample data, model type, model location, model history, frequent queries and users, and other details. You can simply use keyword to search in Catalog explorer or in Navigational Search (link), and it returns list of models that match the keyword as you can see below

Catalog Explorer

jwdbx_5-1747320151770.png

 Navigational Search Bar (located in the top bar of the Databricks workspace)

jwdbx_6-1747320151738.png

Figure 5: Discoverability with UC

You can use tags (link) in Unity Catalog to organize and categorize objects, making search and discovery easier. Tags can be replicated globally, so avoid using sensitive information in tag names or values to maintain security.

Collaboration

Unity Catalog centralizes data and AI assets, enabling seamless collaboration across teams by allowing models to be accessed from any workspace connected to the metastore (link), provided users have the necessary privileges. This setup makes it easy to compare new models with production baselines across different environments, such as accessing production models from a development workspace. To facilitate collaboration, model ownership can be transferred to a group that includes all collaborators.

JianWu_0-1747348843770.png

Figure 6: Organizing datasets, feature tables, functions, volumes, and ML models

The diagram above shows how Unity Catalog can be used to organize datasets, feature tables, functions, volumes, and ML models within a project-specific schema (in this case, churn).

Auditing

Unity Catalog automatically captures detailed audit logs of all actions performed during machine learning development within the Databricks environment. These logs include information on access and modifications to data and AI assets, such as feature tables (link), experiments (link) and models (link). They provide fine-grained visibility into who accessed specific datasets and what actions were taken, ensuring transparency and accountability.

 

Conclusion

Unity Catalog isn’t just a governance tool—it’s a strategic enabler across your ML development lifecycle. With thoughtful design and consistent organization, you can enable the organization to meet stringent industry compliance standards and auditing requirements while also fostering enhanced team collaboration and streamlined lineage tracking and asset discovery.

 

Call To Action

If you're just getting started, map out your catalogs and schemas using the medallion architecture (link) and identify use case-specific schemas.

Already using Unity Catalog? Take the next step by aligning your access policies, tagging standards, and ML model registration with the design principles shared above.

Want to learn more or need help designing your Unity Catalog structure? Reach out to your Databricks account team or explore Databricks Unity Catalog documentation.