cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What are some best practices for CICD?

Anonymous
Not applicable

A number of people have questions on using Databricks in a productionalized environment. What are the best practices to enable CICD automation?

1 ACCEPTED SOLUTION
11 REPLIES 11

Anonymous
Not applicable

Databricks enables CICD using REST API. That allows build servers (such as Jenkins, github actions, etc) to update artifacts.

In addition one should use CICD practices to store databricks job config. Use the REST API to update and refresh those in the appropriate environment.

https://docs.databricks.com/dev-tools/api/latest/index.html

gbrueckl
Contributor II

If you use PowerShell for your CI/CD pipelines you may want to have a look at my DatabricksPS module which is a wrapper for the Databricks REST API.

It has dedicated cmdlets to export (e.g. from DEV) and import again (e.g. to TEST/PROD) in an automated fashion. This includes notebooks, job-definitions, cluster-definitions, secrets, ..., (and SQL objects to come soon!)

https://www.powershellgallery.com/packages/DatabricksPS

pawelmitrus
New Contributor III

Some time ago I've been looking for the very same answers and this is what I found/did back then:

I'd be happy to discuss

Kristian_Schnei
New Contributor II

I don't know if it's best practice, but perhaps it can serve as inspiration.

We do CI/CD with unit test of pyspark code with github actions. Have a look:

https://github.com/Energinet-DataHub/geh-aggregations#databricks-workspace

https://github.com/Energinet-DataHub/geh-aggregations/actions/workflows/aggregation-job-infra-cd.yml

User16859945835
New Contributor II

Two additional resources come to mind:

  1. If using Jenkins, there's a best practice guide for CI/CD using Jenkins that was written based on numerous successful implementations.
  2. There is an on-demand webinar focused on Getting Workloads to Production from a DevOps and CI/CD perspective.

alexott
Valued Contributor II
Valued Contributor II

there is no such REST API (yet)

Atanu
Esteemed Contributor
Esteemed Contributor

We have DBFS APIs, but not sure if that solve your purpse @Erik Parmann​  . https://docs.databricks.com/dev-tools/api/latest/dbfs.html#list you can check out this.

gbrueckl
Contributor II

I guess one could say the same for all the SQL meta objects for which you also need to have a cluster up and running

but some just need a cluster up and running to validate and check them

some even rely on a cluster e.g. for authentication - if you used a mount point with OAuth for example

LorenRD
Contributor

Hello there, I would like to retake this thread to ask for good practices for Data processes in Databricks. I have 2 cloud accounts with one Databricks env in each one (One for dev another for prod).

I was thinking to create my own CI/CD pipeline to move notebooks from dev env to prod and schedule them with GitHub and Azure DevOps but would like to see what community recommends.

I've seen that ci-cd templates is deprecated and now is recommended to use dbx, is it a tool for that purpose?

xiangzhu
Contributor

https://github.com/databrickslabs/cicd-templates is legacy now, the updated one here: dbx.

you need to walkthrough the doc carefully, there're many informations inside, and BTW `dbx init` could create a template.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.