Databricks Community

Anonymous · ‎06-07-2021

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

MadelynM · ‎11-08-2021

Check out the Databricks Labs CI/CD Templates. This repository provides a template for automated Databricks CI/CD pipeline creation and deployment.

Table of Contents

View solution in original post

Anonymous · ‎06-07-2021

Databricks enables CICD using REST API. That allows build servers (such as Jenkins, github actions, etc) to update artifacts.

In addition one should use CICD practices to store databricks job config. Use the REST API to update and refresh those in the appropriate environment.

https://docs.databricks.com/dev-tools/api/latest/index.html

gbrueckl · ‎09-02-2021

If you use PowerShell for your CI/CD pipelines you may want to have a look at my DatabricksPS module which is a wrapper for the Databricks REST API.

It has dedicated cmdlets to export (e.g. from DEV) and import again (e.g. to TEST/PROD) in an automated fashion. This includes notebooks, job-definitions, cluster-definitions, secrets, ..., (and SQL objects to come soon!)

https://www.powershellgallery.com/packages/DatabricksPS

pawelmitrus · ‎09-14-2021

Some time ago I've been looking for the very same answers and this is what I found/did back then:

https://menziess.github.io/howto/ - my source of inspiration (tech part)
https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html - based on that I've been trying to set up some process-wise baseline
https://github.com/pawelmitrus/azure-databricks-cicd-notebooks - my own idea of solving the problem of CI/CD for the very common projects, using notebooks for running ETLs (it's kind of a mix of all above)

I'd be happy to discuss

Kristian_Schnei · ‎09-14-2021

I don't know if it's best practice, but perhaps it can serve as inspiration.

We do CI/CD with unit test of pyspark code with github actions. Have a look:

https://github.com/Energinet-DataHub/geh-aggregations#databricks-workspace

https://github.com/Energinet-DataHub/geh-aggregations/actions/workflows/aggregation-job-infra-cd.yml

MadelynM · ‎11-08-2021

Check out the Databricks Labs CI/CD Templates. This repository provides a template for automated Databricks CI/CD pipeline creation and deployment.

Table of Contents

User16859945835 · ‎11-08-2021

Two additional resources come to mind:

If using Jenkins, there's a best practice guide for CI/CD using Jenkins that was written based on numerous successful implementations.
There is an on-demand webinar focused on Getting Workloads to Production from a DevOps and CI/CD perspective.

Erik · ‎11-11-2021

We are using the databricks terraform provider to handle... everything really. Then we use a CI runner (in our case azure pipelines) to deploy to dev/test/prod depending on branches and stuff in git (you might prefer tags/branches whatever your branching strategy is).

It works pretty good, EXCEPT, validating mounts take a long time (10-15 min) because it needs to spin up a cluster. That is pretty lame, and the only fix seems to be for databricks to make a REST API letting you list/modify mounts, but this is nowhere on any list.

alexott · ‎11-25-2021

there is no such REST API (yet)

Erik · ‎11-28-2021

Unfortunately not. Where do I vote to make it happen faster?

Atanu · ‎11-27-2021

We have DBFS APIs, but not sure if that solve your purpse @Erik Parmann . https://docs.databricks.com/dev-tools/api/latest/dbfs.html#list you can check out this.

Erik · ‎11-28-2021

(Note: The mount issue is not mine alone, its a problem for everyone using the terraform databricks provider )

I guess one could use DBFS api to determine if anything was present at the mount-point, but it wont tell you if its because something is actually mounted there, or where it is mounted from. So one would still have to start a cluster to check those things:-/

gbrueckl · ‎11-29-2021

I guess one could say the same for all the SQL meta objects for which you also need to have a cluster up and running

but some just need a cluster up and running to validate and check them

some even rely on a cluster e.g. for authentication - if you used a mount point with OAuth for example

LorenRD · ‎08-17-2022

Hello there, I would like to retake this thread to ask for good practices for Data processes in Databricks. I have 2 cloud accounts with one Databricks env in each one (One for dev another for prod).

I was thinking to create my own CI/CD pipeline to move notebooks from dev env to prod and schedule them with GitHub and Azure DevOps but would like to see what community recommends.

I've seen that ci-cd templates is deprecated and now is recommended to use dbx, is it a tool for that purpose?