2021-09 webinar: Automating the ML Lifecycle With Databricks Machine Learning (post 1 of 2)
Thank you to everyone who joined the Automating the ML Lifecycle With Databricks Machine Learning webinar! You can access the on-demand recording here and the code in this Github repo.
We're sharing a subset of the questions asked and answered throughout the session, as well as the links to resources in the last slide of the webinar. Please feel free to ask follow-up questions or add comments as threads. Due to length limits on Community posts, we’ll split this in two.
Databricks ML
- How can I enable the Databricks ML workspace?
- How can I get started with Databricks ML?
AutoML
- How does your AutoML compare with other enterprise AutoML approaches?
- I'd say the highest level bit is that Databricks AutoML takes a "glass-box" approach, generating notebooks for every model it fits. That allows you to clone and modify the code to further iterate on the models. In general, all AutoML solutions generate pretty good results---but not as good as models with more expert knowledge incorporated. This code generation approach lets data scientists get a reasonable model quickly and then incorporate their domain expertise to improve the model further. For a good intro to it, I'd recommend checking out the Data AI Summit 2021 keynote on Databricks ML: https://youtu.be/zQEiwJqqeeA
General MLflow
- What support do MLflow and Databricks have for R?
- What is MLflow autologging vs. Databricks autologging?
- What ML frameworks are supported by MLflow?
- How can I track which dataset was used to train each model in MLflow and Databricks?
- If you're using Databricks AutoML, it automatically logs the dataset to the MLflow Tracking Server.
- If you’re writing custom ML code, then your best options are:
- For Spark data sources, especially Delta: If you use autologging and read from a Spark datasource, it will log that as a tag in the MLflow run. If that's a Delta datasource, then it saves the table version number.
- For non-Spark data sources (e.g., loading via pandas), you can always log a custom tag or param to save the dataset location, ID or version number.
Model Registry
- Is the MLflow registry restricted to a workspace? Or can multiple workspaces push to a centralized or common registry?