Databricks Community

kamilmuszynski · ‎03-06-2024

I'm trying to adopt a code base to use asset bundles. I was trying to come up with folder structure that will work for our bundles and came up with layout as below:


common/ (source code)
services/ (source code)
dist/ (here artifacts from monorepo are built; I can't change this)
db-asset-bundles/
  data-pipeline/
    integration/
      databricks.yaml 
    production/
      databricks.yaml
    resources/
      variables.yaml
      artifacts.yaml

I'd like for integration and production bundles to share some common configuration. I've discovered that I can include '../resources/variables.yaml' from both integration/databricks.yaml and production/databricks.yaml, but including artifacts result in:

Error: path (...redacted...)/db-asset-bundles/data-pipeline/resources is not contained in bundle root path

Are there any rules of what can be included from databricks.yaml - does it have to be a folder on the same level, or below the file?

The same problem happens when I try to include wheels built into /dist directory in the root of monorepo - I can't reference to them from databricks.yaml as it would require a path like '../../../dist/[wheel-name]', and that results in the same error about wheel not being contained in bundle root. So far I've worked around this by defining artifact in production/databricks.yaml as:

artifacts:
pipeline-wheel:
type: whl
build: "pants package <path to wheel definition inside services> && mkdir dist && cp ../../../dist/<wheel file> dist/<wheel file>"
# we use pantsbuild.org buildsystem for python that manages wheel packaging, but all artifacts end up in /dist dir at root level...

Are there any ways around this that I'm missing?

Thanks a lot!

AlbertoLogs · ‎10-16-2024

@kamilmuszynski – Did you figure it out already?

PabloCSD · ‎10-16-2024

When I have worked with Databricks Asset Bundles (DAB), I left the databricks.yaml file in the root, and just one databricks.yaml file.

I also made a simple functional DAB project, the file system structure is like this, if it helps you:

dab_test_repo/
├── conf/
│   └── tasks/
│       ├── input_task_config.yml
│       ├── process_task_config.yml
│       └── output_task_config.yml
├── dab_test_repo/
│   ├── tasks/
│   │   ├── __init__.py
│   │   ├── input.py
│   │   ├── process.py
│   │   └── output.py
│   ├── __init__.py
│   ├── common.py
├── tests/
│   ├── unit/
│   │   ├── tasks/
│   │   │   ├── __init__.py
│   │   │   ├── test_input.py
│   │   │   ├── test_process.py
│   │   │   └── test_output.py
│   │   ├── __init__.py
│   │   └── conftest.py
│   ├── __init__.py
├── dist/
│   └── dab_test_repo-0.1.0-py3-none-any.whl
├── .gitignore
├── .pre-commit-config.yaml
├── README.md
├── databricks.yml
└── pyproject.toml

I haven't tried with many databricks.yml file, but in the databricks.yml I have configurations for integration and production pipelines for deploying them.

kamilmuszynski · ‎10-18-2024

Thanks for the suggestion.

What I ended up doing was to have a separate directory with databricks.yaml per each pipeline, but the file was including all targets (dev, int, prod). I think having a top level databricks.yaml is something that would also work, with proper excludes per target - I need to give it a try at some point 🙂