docs.databricks.com

Joseph_B — Fri, 08 Oct 2021 16:05:02 GMT

2021-09 webinar: Automating the ML Lifecycle With Databricks Machine Learning (post 1 of 2)

Thank you to everyone who joined the Automating the ML Lifecycle With Databricks Machine Learning webinar! You can access the on-demand recording here and the code in this Github repo.

We're sharing a subset of the questions asked and answered throughout the session, as well as the links to resources in the last slide of the webinar. Please feel free to ask follow-up questions or add comments as threads. Due to length limits on Community posts, we’ll split this in two.

Databricks ML

How can I enable the Databricks ML workspace?
- In your Databricks workspace, you should be able to see a selector in the upper-left. There's a GIF of selecting it here: https://docs.databricks.com/applications/machine-learning/index.html You can also pin a particular “persona” or workspace view to be your default.
How can I get started with Databricks ML?
- If you want guided tutorials, then the Databricks Academy has great resources, especially its recommended learning path for Data Scientists: https://academy.databricks.com/data-scientist These resources are free for customers; contact support if you have trouble accessing them.
- If there is a particular task you want to do, start from our documentation https://docs.databricks.com/applications/machine-learning/index.html to find the right page, and look for examples and code for that task.

AutoML

How does your AutoML compare with other enterprise AutoML approaches?
- I'd say the highest level bit is that Databricks AutoML takes a "glass-box" approach, generating notebooks for every model it fits. That allows you to clone and modify the code to further iterate on the models. In general, all AutoML solutions generate pretty good results---but not as good as models with more expert knowledge incorporated. This code generation approach lets data scientists get a reasonable model quickly and then incorporate their domain expertise to improve the model further. For a good intro to it, I'd recommend checking out the Data AI Summit 2021 keynote on Databricks ML: https://youtu.be/zQEiwJqqeeA

General MLflow

What support do MLflow and Databricks have for R?
- MLflow has native support for R. You can find the R API docs here: https://mlflow.org/docs/latest/R-api.html
- If you're working within Databricks, then Databricks Runtimes provide R and many common packages out-of-the-box. We generally recommend using the Databricks Runtime for Machine Learning since it provides more ML-specific packages, and it makes it easy to run RStudio in Databricks. To find which version of R each runtime uses, you can check the runtime release notes.
What is MLflow autologging vs. Databricks autologging?
- MLflow provides autologging which automatically tracks ML training activity for certain libraries. E.g., mlflow.sklearn.autolog() triggers tracking for scikit-learn, picking up parameters, metrics, and models when you train models. You can also call mlflow.autolog() to turn on all types of MLflow autologging. "Databricks Autologging" turns on MLflow autologging by default, and it just entered Public Preview in many regions: https://docs.databricks.com/applications/mlflow/databricks-autologging.html (That's for AWS, but there's an equivalent page for Azure and GCP.) You can find more info in those docs.
What ML frameworks are supported by MLflow?
- MLflow has built-in support for many common frameworks, but it is also pluggable and can be used with any ML framework.
- For autologging, the MLflow docs provide a list of built-in integrations, as well as info on custom logging.
- For saving models (MLflow Models and “flavors”), the MLflow docs provide a list of built-in integrations, as well as info on customization via pyfunc models and custom flavors.
How can I track which dataset was used to train each model in MLflow and Databricks?
- If you're using Databricks AutoML, it automatically logs the dataset to the MLflow Tracking Server.
- If you’re writing custom ML code, then your best options are:
  - For Spark data sources, especially Delta: If you use autologging and read from a Spark datasource, it will log that as a tag in the MLflow run. If that's a Delta datasource, then it saves the table version number.
  - For non-Spark data sources (e.g., loading via pandas), you can always log a custom tag or param to save the dataset location, ID or version number.

Model Registry

Is the MLflow registry restricted to a workspace? Or can multiple workspaces push to a centralized or common registry?
- You can set up a multi-workspace registry. That's common for splitting into dev/test/prod workspaces, all of which share one registry. Here's some more info on that: https://docs.databricks.com/applications/machine-learning/manage-model-lifecycle/multiple-workspaces.html

topic docs.databricks.com in Machine Learning

docs.databricks.com