Databricks Community

Joseph_B · ‎10-08-2021

2021-09 webinar: Automating the ML Lifecycle With Databricks Machine Learning (Post 2 of 2)

Thank you to everyone who joined! You can access the on-demand recording here and the code in this Github repo.

We're sharing a subset of the questions asked and answered throughout the session, as well as the links to resources in the last slide of the webinar. Due to length limits on Community posts, we’ll split this in two.

Scoring and Serving

How can I deploy MLflow models in batch or streaming settings?
- The easiest way is generally to load the model as a “pyfunc” model (or equivalent in R/Scala/Java) and apply that as an Apache Spark UDF (User Defined Function) to a Spark DataFrame, which can be batch or streaming. In Databricks, when you open the MLflow page for a run and click on the model artifact, you can see generated code for doing exactly that: loading the model and applying it to a DataFrame.
How does MLflow connect with 3rd-party serving tools for REST API serving?
- MLflow is able to deploy models to 3rd-party tools for REST API serving, using either built-in tools or custom containers. For AWS SageMaker and Azure ML, MLflow provides a one-liner API call for deployment; find more info in the docs.
- For other services, you can either have those services use the MLflow client to load models (e.g., in a container you build from scratch, in which you use the load API), or you can use MLflow’s tools to generate containers which you can then customize further (see the build-docker command).

Jobs and CI/CD

How can I do CI/CD on ML pipelines within Databricks?
- This of course is a big topic, but I’ll point to some important features and useful resources.
- For code management, the primary modes of operation are:
  - Either use Databricks Repos to manage your code with git, which lets you edit and run code both within Databricks and within external tools like IDEs and CI/CD systems.
  - Or develop outside of Databricks in an IDE and deploy your code on Databricks as packages/libraries.
- For CI/CD tools, it’s common to use dedicated tools like Jenkins, especially if the organization is already using such a tool.
- The doc section on developer tools and guidance is helpful.
- The Databricks Labs CICD-templates tool is also powerful for programmatic management.
- This recent blog post provides a great example of CI/CD with Repos.
What options do I have for job scheduling and orchestration in Databricks?
- Databricks Jobs provide a way to encapsulate a task in a job. The newer multi-task job support allows you to chain multiple tasks together. These Job features support schedules, triggers, permissions, auditing, notifications, APIs, etc.
- Using Jobs, you have 2 main options for scheduling and orchestration:
  - Only use Databricks Jobs: This is reasonable if you are only orchestrating Databricks Jobs, not tasks outside of Databricks.
  - Use an external orchestrator like AirFlow: This can be better if you have more diverse tasks to orchestrate and/or more complex dependencies between tasks.
Can Jobs use single machines, rather than clusters? And can they use GPUs?
- Yes, for both interactive work and automated Jobs, you can configure clusters in the Single Node mode. They can still run (small) Spark jobs, and save you on costs.
- Similarly, you can create both interactive clusters and automated Job clusters with GPUs.

Data

If the user reads data from Delta tables to build models, is there a way to give access to only the Delta table and not the underlying files in AWS S3 / Azure ADLS / GCP GCS?
- Yes, we recommend using Table Access Control Lists (Table ACLs) for that. These provide fine-grained access controls which can tie into your identity management system. See the docs for details.
- An important roadmap item to know about is the Unity Catalog; watch the Data+AI Summit 2021 keynote on that to learn more.

Resources linked in the last webinar slide