Databricks Community

mlopsuser · a month ago

I have two different datasets that will be used to train two separate regression models Each dataset has its own preprocessing steps, and the models will have independent training pipelines.

What best practice approach for organizing Databricks Asset Bundles (DABs) in this scenario? Specifically, I’m wondering whether it’s better to create one DAB per model and dataset or to combine everything into a single DAB for simplicity.

Additionally, any insights on structuring the MLOps pipeline for model registry, deployment, and monitoring in such a setup would be greatly appreciated.

DAB will be on a monorepo for new use case

NandiniN · 3 weeks ago

Hi @mlopsuser ,

For organizing Databricks Asset Bundles (DABs) in your scenario with two separate regression models and datasets, it is generally recommended to create one DAB per model and dataset. This approach aligns with best practices for modularity and maintainability, allowing each model and its associated preprocessing steps to be managed independently. Here are some detailed steps and considerations:

Create Separate DABs:
- Modularity: By creating separate DABs for each model and dataset, you ensure that changes in one model or dataset do not inadvertently affect the other. This modular approach simplifies debugging and enhances the clarity of your project structure.
- Scalability: Independent DABs make it easier to scale and manage each model's lifecycle, including training, evaluation, and deployment.
Structuring the MLOps Pipeline:
- Model Registry: Use MLflow to register each model independently. This allows you to track versions, manage metadata, and monitor performance metrics for each model separately.
- Deployment: Deploy each model using its respective DAB. This ensures that the deployment process is isolated and can be tailored to the specific requirements of each model.
- Monitoring: Set up monitoring for each model independently. This includes tracking performance metrics, data drift, and other relevant indicators to ensure each model remains performant over time.
Monorepo Considerations:
- Directory Structure: Organize your monorepo with clear directory structures for each DAB. For example: /monorepo ├── model1 │ ├── databricks.yml │ ├── src/ │ ├── tests/ ├── model2 │ ├── databricks.yml │ ├── src/ │ ├── tests/
- CI/CD Integration: Implement CI/CD pipelines that can handle multiple DABs. Ensure that each pipeline is capable of independently validating, testing, and deploying the respective DAB.
Best Practices:
- Version Control: Use version control to manage changes to each DAB. This includes tracking changes to preprocessing steps, model training code, and deployment configurations.
- Documentation: Maintain comprehensive documentation for each DAB, detailing the preprocessing steps, model architecture, and deployment process. This aids in collaboration and future maintenance.

Thanks!

Databricks Community

Databricks Asset Bundles and MLOps Structure for different model training -1 model per DAB or 1 DAB

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years