cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What are some best practices for CICD?

Anonymous
Not applicable

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

1 ACCEPTED SOLUTION

Accepted Solutions

MadelynM
New Contributor III
New Contributor III
15 REPLIES 15

Anonymous
Not applicable

Databricks enables CICD using REST API. That allows build servers (such as Jenkins, github actions, etc) to update artifacts.

In addition one should use CICD practices to store databricks job config. Use the REST API to update and refresh those in the appropriate environment.

https://docs.databricks.com/dev-tools/api/latest/index.html

gbrueckl
Contributor II

If you use PowerShell for your CI/CD pipelines you may want to have a look at my DatabricksPS module which is a wrapper for the Databricks REST API.

It has dedicated cmdlets to export (e.g. from DEV) and import again (e.g. to TEST/PROD) in an automated fashion. This includes notebooks, job-definitions, cluster-definitions, secrets, ..., (and SQL objects to come soon!)

https://www.powershellgallery.com/packages/DatabricksPS

pawelmitrus
Contributor

Some time ago I've been looking for the very same answers and this is what I found/did back then:

I'd be happy to discuss

Kristian_Schnei
New Contributor II

I don't know if it's best practice, but perhaps it can serve as inspiration.

We do CI/CD with unit test of pyspark code with github actions. Have a look:

https://github.com/Energinet-DataHub/geh-aggregations#databricks-workspace

https://github.com/Energinet-DataHub/geh-aggregations/actions/workflows/aggregation-job-infra-cd.yml

MadelynM
New Contributor III
New Contributor III

User16859945835
New Contributor II

Two additional resources come to mind:

  1. If using Jenkins, there's a best practice guide for CI/CD using Jenkins that was written based on numerous successful implementations.
  2. There is an on-demand webinar focused on Getting Workloads to Production from a DevOps and CI/CD perspective.

Erik
Valued Contributor II

We are using the databricks terraform provider to handle... everything really. Then we use a CI runner (in our case azure pipelines) to deploy to dev/test/prod depending on branches and stuff in git (you might prefer tags/branches whatever your branching strategy is).

It works pretty good, EXCEPT, validating mounts take a long time (10-15 min) because it needs to spin up a cluster. That is pretty lame, and the only fix seems to be for databricks to make a REST API letting you list/modify mounts, but this is nowhere on any list.

alexott
Valued Contributor II
Valued Contributor II

there is no such REST API (yet)

Erik
Valued Contributor II

Unfortunately not. Where do I vote to make it happen faster?

Atanu
Esteemed Contributor
Esteemed Contributor

We have DBFS APIs, but not sure if that solve your purpse @Erik Parmann​  . https://docs.databricks.com/dev-tools/api/latest/dbfs.html#list you can check out this.

Erik
Valued Contributor II

(Note: The mount issue is not mine alone, its a problem for everyone using the terraform databricks provider )

I guess one could use DBFS api to determine if anything was present at the mount-point, but it wont tell you if its because something is actually mounted there, or where it is mounted from. So one would still have to start a cluster to check those things:-/

gbrueckl
Contributor II

I guess one could say the same for all the SQL meta objects for which you also need to have a cluster up and running

but some just need a cluster up and running to validate and check them

some even rely on a cluster e.g. for authentication - if you used a mount point with OAuth for example

LorenRD
Contributor

Hello there, I would like to retake this thread to ask for good practices for Data processes in Databricks. I have 2 cloud accounts with one Databricks env in each one (One for dev another for prod).

I was thinking to create my own CI/CD pipeline to move notebooks from dev env to prod and schedule them with GitHub and Azure DevOps but would like to see what community recommends.

I've seen that ci-cd templates is deprecated and now is recommended to use dbx, is it a tool for that purpose?

xiangzhu
Contributor II

https://github.com/databrickslabs/cicd-templates is legacy now, the updated one here: dbx.

you need to walkthrough the doc carefully, there're many informations inside, and BTW `dbx init` could create a template.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!