Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1 DAB

mlopsuser — Mon, 21 Oct 2024 16:33:47 GMT

I have two different datasets that will be used to train two separate regression models Each dataset has its own preprocessing steps, and the models will have independent training pipelines.

What best practice approach for organizing Databricks Asset Bundles (DABs) in this scenario? Specifically, I’m wondering whether it’s better to create one DAB per model and dataset or to combine everything into a single DAB for simplicity.

Additionally, any insights on structuring the MLOps pipeline for model registry, deployment, and monitoring in such a setup would be greatly appreciated.

DAB will be on a monorepo for new use case

Re: Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1

NandiniN — Fri, 01 Nov 2024 05:22:55 GMT

Hi @mlopsuser ,

For organizing Databricks Asset Bundles (DABs) in your scenario with two separate regression models and datasets, it is generally recommended to create one DAB per model and dataset. This approach aligns with best practices for modularity and maintainability, allowing each model and its associated preprocessing steps to be managed independently. Here are some detailed steps and considerations:

Create Separate DABs:
- Modularity: By creating separate DABs for each model and dataset, you ensure that changes in one model or dataset do not inadvertently affect the other. This modular approach simplifies debugging and enhances the clarity of your project structure.
- Scalability: Independent DABs make it easier to scale and manage each model's lifecycle, including training, evaluation, and deployment.
Structuring the MLOps Pipeline:
- Model Registry: Use MLflow to register each model independently. This allows you to track versions, manage metadata, and monitor performance metrics for each model separately.
- Deployment: Deploy each model using its respective DAB. This ensures that the deployment process is isolated and can be tailored to the specific requirements of each model.
- Monitoring: Set up monitoring for each model independently. This includes tracking performance metrics, data drift, and other relevant indicators to ensure each model remains performant over time.
Monorepo Considerations:
- Directory Structure: Organize your monorepo with clear directory structures for each DAB. For example: /monorepo ├── model1 │ ├── databricks.yml │ ├── src/ │ ├── tests/ ├── model2 │ ├── databricks.yml │ ├── src/ │ ├── tests/
- CI/CD Integration: Implement CI/CD pipelines that can handle multiple DABs. Ensure that each pipeline is capable of independently validating, testing, and deploying the respective DAB.
Best Practices:
- Version Control: Use version control to manage changes to each DAB. This includes tracking changes to preprocessing steps, model training code, and deployment configurations.
- Documentation: Maintain comprehensive documentation for each DAB, detailing the preprocessing steps, model architecture, and deployment process. This aids in collaboration and future maintenance.

Thanks!

topic Re: Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1 in Data Engineering

Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1 DAB

Re: Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1