cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

GitHub CI/CD Best Practices

j_h_robinson
New Contributor II

Using GitHub, what are some best-practice CI/CD approaches to use specifically with the silver and gold medallion layers? We want to create the bronze, silver, and gold layers in Databricks notebooks.

Also, is using notebooks in production a "best practice?"

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

For Databricks projects using the medallion architecture (bronze, silver, gold layers), effective CI/CD strategies on GitHub include strict version control, environment isolation, automated testing and deployments, and careful notebook managementโ€”all tailored to each stage's requirements. Using notebooks in production is possible, but not always considered best practice for all scenarios; a hybrid approach may be preferable.

CI/CD Best Practices for Silver and Gold Layers

  • Organize repositories to clearly separate bronze, silver, and gold layers, using distinct folders or branches for each layer to streamline management and change tracking.โ€‹

  • Apply Git branching strategies (feature/development, staging, main/production) to control integration and deployment stages.โ€‹

  • Use tools like Databricks CLI, MLflow, or dbx to store notebooks as source files and enable version control, which supports reversion and collaboration.

  • Automate testing for transformations in silver and gold layers using frameworks suited to Spark (e.g., pytest, chispa), and validate notebooks with Databricks CLI's bundle validate command.โ€‹

  • Set up GitHub Actions (or appropriate cloud-native CI/CD services) to trigger tests and lint checks on code pushes, validating all changes before merging.โ€‹

  • Automate deployments to Databricks, triggering layer-specific jobs only after passing all tests, protecting downstream data and business logic.โ€‹

  • Use parameterization to manage configuration differences across development, staging, and production environments.โ€‹

  • Maintain documentation and monitor deployments for reliability, problems, and performance issues in silver/gold layers.โ€‹

  • Employ isolated identities and clusters for each layer, ensuring security and preventing accidental or malicious access between layers.โ€‹

  • Bundle jobs, notebooks, and infrastructure as unified assets using Databricks Asset Bundles or Terraform, ensuring consistent deployments.โ€‹

Using Notebooks in Production

  • Notebooks in Databricks can serve as production assets, especially when integrated with CI/CD, code review, and testing practices.โ€‹

  • Notebooks are valuable for low-barrier, rapid prototyping, and interactive workflows, and many teams use them in scheduled, production jobs.โ€‹

  • For production-level software engineering, consider combining notebooks (as orchestrators or wrappers) with well-tested libraries or scripts maintained as .py files, leveraging the best of both approaches.โ€‹

  • Some professionals caution against notebooks for complex production workloads due to challenges with structured testing, pre-commit hooks, code review, and integration into standard CI/CD pipelines, recommending more modular Python scripts for maximum reliability and maintainability.โ€‹

  • Databricks itself documents best practices for making production notebooks robust, emphasizing modular code, integrated version control, automated tests, and reproducibility.โ€‹

Key Takeaways

  • Notebooks are widely used in Databricks production but are best when supplemented with robust CI/CD workflows and strong testing just like any other code artifact.

  • For the silver and gold layers, focus on separation of concerns, test validation, and strong security/isolation.

  • Consider a hybrid approach: use notebooks for orchestration and rapid iteration, but anchor complex logic and business rules in maintainable, versioned Python packages or scripts referenced by the notebooks for production stability.

Most production-grade teams blend both approaches, leveraging the ease of notebooks and the reliability of code modularity, testing, and CI/CD rigor found in traditional software engineering.โ€‹