2021-08-Best-Practices-for-Your-Data-Architecture-v3-OG-1200x628
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2021 10:31 AM
Thanks to everyone who joined the Best Practices for Your Data Architecture session on Getting Workloads to Production using CI/CD. You can access the on-demand session recording here, and the code in the Databricks Labs CI/CD Templates Repo.
Posted below is a subset of the questions asked and answered throughout the session. Please feel free to ask follow-up questions or add comments as threads.
Q: What are examples of scheduling Notebooks with Airflow?
Check out the blog detailing the integration between Databricks and Airflow and read the docs with examples (AWS | Azure | GCP). Also, take a look at the Multitask Jobs capabilities, which is a Databricks-Native jobs scheduler.
Q: Will AWS MWAA also work with notebooks?
Yes, the docs show that Databricks Connection is available for AWS MWAA.
Q: Unit Testing and Integration testing - are there frameworks for testing notebooks?
The session has an example leveraging a framework using Nutter and pytest. Here are a couple of links to the documentation for you to take a look at:
1. https://github.com/microsoft/nutter [integration testing]
2. https://docs.pytest.org/en/6.2.x/ [unit testing]
There certainly are other frameworks depending on what code you're testing and the nature of the tests you are conducting, but we like these frameworks due to the tools’ simplicity and open source nature.
Q: Is it possible to integrate MLFlow to deploy models artifact within this CI/CD process?
Yes, please take a look at this blog, Using MLOps with MLflow and Azure.
Add your follow-up questions to threads!
- Labels:
-
Airflow
-
AWS
-
Azure
-
Jobs & Workflows
-
MlFlow
-
Pytest
-
Unit testing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2021 01:20 PM
Would it be possible to get the power point that was used for this? There are several embedded links that would be beneficial but cannot be accessed from a video. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2021 01:00 PM
Here's the embedded links list!
Jobs scheduling and orchestration
- Built-in job scheduling: https://docs.databricks.com/jobs.html#schedule-a-job
- Periodic scheduling of the jobs
- Execute notebook / jar / Python script / Spark-submit
- Multitask Jobs
- Execute notebook / jar / Python script / Spark-submit
- Contrib module in Airflow
- Execute notebook / jar / Python script
Development interface resources
- Notebooks: https://docs.databricks.com/notebooks/index.html#notebooks
- More flexibility and control than the self-service git integration (Workspace CLI)
- Databricks REST API
- Databricks CLI - interface to the REST API
- Databricks Terraform Provider - create reproducible environments
- Databricks Connect - executing code on the Databricks cluster(s) from the local
- R Studio: https://docs.databricks.com/spark/latest/sparkr/rstudio.html
Testing Code
- Notebook driven: https://databricks.com/blog/2020/01/16/automate-deployment-and-testing-with-databricks-notebook-mlfl...
- CICD automated testing: https://docs.databricks.com/dev-tools/ci-cd.html
- Nutter library (Microsoft): https://github.com/microsoft/nutter
- spark-testing-base (Scala & Python support)
- spark-fast-tests (Scala, Spark 2 & 3)
- chispa (Python version of spark-fast-tests)
- pytest-spark (Python, native integration with pytest)
- Code samples for all libraries in one place
Source code repository resources
- Git integration: https://docs.databricks.com/notebooks/github-version-control.html#enable-and-disable-git-versioning
- Databricks CLI:
Code promotion resources

