cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1 DAB

mlopsuser
New Contributor

I have two different datasets that will be used to train two separate regression models Each dataset has its own preprocessing steps, and the models will have independent training pipelines.

What best practice approach for organizing Databricks Asset Bundles (DABs) in this scenario? Specifically, Iā€™m wondering whether itā€™s better to create one DAB per model and dataset or to combine everything into a single DAB for simplicity.

Additionally, any insights on structuring the MLOps pipeline for model registry, deployment, and monitoring in such a setup would be greatly appreciated.

DAB will be on a monorepo for new use case

1 REPLY 1

NandiniN
Databricks Employee
Databricks Employee

Hi @mlopsuser ,

For organizing Databricks Asset Bundles (DABs) in your scenario with two separate regression models and datasets, it is generally recommended to create one DAB per model and dataset. This approach aligns with best practices for modularity and maintainability, allowing each model and its associated preprocessing steps to be managed independently. Here are some detailed steps and considerations:

  1. Create Separate DABs:

    • Modularity: By creating separate DABs for each model and dataset, you ensure that changes in one model or dataset do not inadvertently affect the other. This modular approach simplifies debugging and enhances the clarity of your project structure.
    • Scalability: Independent DABs make it easier to scale and manage each model's lifecycle, including training, evaluation, and deployment.
  2. Structuring the MLOps Pipeline:

    • Model Registry: Use MLflow to register each model independently. This allows you to track versions, manage metadata, and monitor performance metrics for each model separately.
    • Deployment: Deploy each model using its respective DAB. This ensures that the deployment process is isolated and can be tailored to the specific requirements of each model.
    • Monitoring: Set up monitoring for each model independently. This includes tracking performance metrics, data drift, and other relevant indicators to ensure each model remains performant over time.
  3. Monorepo Considerations:

    • Directory Structure: Organize your monorepo with clear directory structures for each DAB. For example: /monorepo ā”œā”€ā”€ model1 ā”‚ ā”œā”€ā”€ databricks.yml ā”‚ ā”œā”€ā”€ src/ ā”‚ ā”œā”€ā”€ tests/ ā”œā”€ā”€ model2 ā”‚ ā”œā”€ā”€ databricks.yml ā”‚ ā”œā”€ā”€ src/ ā”‚ ā”œā”€ā”€ tests/
    • CI/CD Integration: Implement CI/CD pipelines that can handle multiple DABs. Ensure that each pipeline is capable of independently validating, testing, and deploying the respective DAB.
  4. Best Practices:

    • Version Control: Use version control to manage changes to each DAB. This includes tracking changes to preprocessing steps, model training code, and deployment configurations.
    • Documentation: Maintain comprehensive documentation for each DAB, detailing the preprocessing steps, model architecture, and deployment process. This aids in collaboration and future maintenance.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonā€™t want to miss the chance to attend and share knowledge.

If there isnā€™t a group near you, start one and help create a community that brings people together.

Request a New Group