I am part of a small team of Data Engineers which started using Databricks Asset Bundles one year ago. Our code base consists of typical ETL-workloads written primarily in Jupyter notebooks (.ipynb), and jobs (.yaml) with our codebase spanning across a large number of different business domains.
Currently, our code base consists of a single monorepo with one large bundle containing all our notebooks, jobs, libraries etc.
Our code base has grown to a size where we see the need to split our single bundle into several smaller bundles - one for each business domain.
We are envisioning a setup similar to the following (simplified) structure:
monorepo/
│
├── shared_notebooks/
├── shared_libraries/
├── variables.yml
│
├── Bundle_A/
│ ├── resources/
│ ├── src/
│ └── databricks.yml
│
└── Bundle_B/
├── resources/
├── src/
└── databricks.yml
Where the repo contains some shared notebooks and libraries which may be used in all bundles in our repository.
Does anyone have some suggestions for how this should be implemented?
- How can we "import" shared assets (notebooks, libraries and variables) into our bundles?
- Does our approach to splitting up our mono-bundle repository seem sensible?
Thanks in advance for your insights!
Kaspar Hauser