โ06-07-2021 10:50 AM
A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?
โ11-08-2021 10:38 AM
Check out the Databricks Labs CI/CD Templates. This repository provides a template for automated Databricks CI/CD pipeline creation and deployment.
Table of Contents
โ06-07-2021 10:51 AM
Databricks enables CICD using REST API. That allows build servers (such as Jenkins, github actions, etc) to update artifacts.
In addition one should use CICD practices to store databricks job config. Use the REST API to update and refresh those in the appropriate environment.
โ09-02-2021 11:39 PM
If you use PowerShell for your CI/CD pipelines you may want to have a look at my DatabricksPS module which is a wrapper for the Databricks REST API.
It has dedicated cmdlets to export (e.g. from DEV) and import again (e.g. to TEST/PROD) in an automated fashion. This includes notebooks, job-definitions, cluster-definitions, secrets, ..., (and SQL objects to come soon!)
โ09-14-2021 02:11 AM
Some time ago I've been looking for the very same answers and this is what I found/did back then:
I'd be happy to discuss
โ09-14-2021 03:06 AM
I don't know if it's best practice, but perhaps it can serve as inspiration.
We do CI/CD with unit test of pyspark code with github actions. Have a look:
https://github.com/Energinet-DataHub/geh-aggregations#databricks-workspace
https://github.com/Energinet-DataHub/geh-aggregations/actions/workflows/aggregation-job-infra-cd.yml
โ11-08-2021 10:38 AM
Check out the Databricks Labs CI/CD Templates. This repository provides a template for automated Databricks CI/CD pipeline creation and deployment.
Table of Contents
โ11-08-2021 09:21 PM
Two additional resources come to mind:
โ11-11-2021 06:41 AM
We are using the databricks terraform provider to handle... everything really. Then we use a CI runner (in our case azure pipelines) to deploy to dev/test/prod depending on branches and stuff in git (you might prefer tags/branches whatever your branching strategy is).
It works pretty good, EXCEPT, validating mounts take a long time (10-15 min) because it needs to spin up a cluster. That is pretty lame, and the only fix seems to be for databricks to make a REST API letting you list/modify mounts, but this is nowhere on any list.
โ11-25-2021 10:44 AM
there is no such REST API (yet)
โ11-28-2021 01:34 PM
Unfortunately not. Where do I vote to make it happen faster?
โ11-27-2021 06:57 AM
We have DBFS APIs, but not sure if that solve your purpse @Erik Parmannโ . https://docs.databricks.com/dev-tools/api/latest/dbfs.html#list you can check out this.
โ11-28-2021 01:33 PM
(Note: The mount issue is not mine alone, its a problem for everyone using the terraform databricks provider )
I guess one could use DBFS api to determine if anything was present at the mount-point, but it wont tell you if its because something is actually mounted there, or where it is mounted from. So one would still have to start a cluster to check those things:-/
โ11-29-2021 12:18 AM
I guess one could say the same for all the SQL meta objects for which you also need to have a cluster up and running
but some just need a cluster up and running to validate and check them
some even rely on a cluster e.g. for authentication - if you used a mount point with OAuth for example
โ08-17-2022 04:22 AM
Hello there, I would like to retake this thread to ask for good practices for Data processes in Databricks. I have 2 cloud accounts with one Databricks env in each one (One for dev another for prod).
I was thinking to create my own CI/CD pipeline to move notebooks from dev env to prod and schedule them with GitHub and Azure DevOps but would like to see what community recommends.
I've seen that ci-cd templates is deprecated and now is recommended to use dbx, is it a tool for that purpose?
โ11-18-2022 07:12 AM
https://github.com/databrickslabs/cicd-templates is legacy now, the updated one here: dbx.
you need to walkthrough the doc carefully, there're many informations inside, and BTW `dbx init` could create a template.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group