Databricks Lakeflow enables data teams to design and operate data pipelines at scale, where speed and reliability directly influence the time to market for insights. As pipeline complexity grows, test automation becomes essential to maintaining data quality and ensuring smooth, predictable production workflows. With Databricks Asset Bundles, setting up CI/CD and automated testing on Databricks has become significantly simpler, empowering teams to build with confidence and deliver value faster.
Different layers of testing play a role in building reliable data pipelines:
In this blog, we focus on integration testing, examining two different approaches, and present a blueprint for one of them.
Choosing the right approach for integration testing your Lakeflow jobs in Databricks environments is critical for CI/CD maturity, developer productivity, and increasing confidence when deploying changes to a production environment. Two commonly adopted strategies are: Databricks workflow-based integration testing and local integration tests with tools such as Pytest. Let’s dive deep into both approaches to see how they work, and their advantages and disadvantages.
In this approach, tests are implemented as Databricks notebooks or Python scripts, orchestrated and scheduled via Databricks Lakeflow Jobs. Tests run within the Databricks cloud environment, accessing clusters, data sources, and job code directly. This method requires developers to create and deploy two jobs for each pipeline: a main ETL job and a dedicated integration testing job. The main job handles the ETL tasks, while the integration testing job orchestrates test notebooks to set up environments, execute main jobs, validate results, and clean up resources.
The approach consists of the following phases, each executed as a separate task within the integration test job:
Typically, the integration testing job is deployed and run only in specific environments (e.g. test, acceptance) through a CI/CD process. To ensure isolation and environment-specific functionality, the main ETL job must be parameterized, allowing all environment-dependent values (catalogs, schemas, paths) to be configurable via notebook parameters. The CI agent is responsible for deploying, executing, and validating the integration test results, thereby determining if the build is successful or a failure.
However, this approach presents several drawbacks, primarily due to the necessity of creating and deploying a secondary job for testing alongside the main job. These include:
Additionally, the approach slows down the feedback loop during both test and job development. Running tests requires deploying both jobs with the right parameters to the right environment, running them, waiting for results, and interpreting those results. This context switching breaks the developer flow.
In this approach, tests are written locally using Pytest and developed in your preferred IDE with full debugging capabilities. Tests leverage Databricks Connect to establish a remote execution context on Databricks clusters while orchestrating everything from your local machine or a CI agent. Integration tests are written using Pytest, leveraging its capability for setting up and tearing down resources through fixtures, triggering the job to be tested, and validating the results afterward through Pytest assertions.
The typical workflow is:
Similar to the first approach, the main job under test must be parameterized for all environments with test-specific values configured during test execution. This ensures test isolation, as resources are created specifically for each test run. Developers can execute the test either locally or via a CI agent without deploying additional jobs to Databricks, enabling them to assert job execution results remotely. Pytest also offers test reporting, which helps in validating and quickly identifying successful and failing tests.
This approach provides the benefits of:
At its core, this blueprint defines a pytest-driven framework to run integration tests on each Lakeflow job (with multiple tasks). Every test case follows a standardized structure focused on reproducibility, isolation, and full verification of pipeline outputs.
Each test execution provides consistent validation against real Databricks runtime behavior, from workflow orchestration to data persistence, while preserving the speed and productivity of local workflows.
The databricks-labs-pytester package is a valuable utility for orchestrating integration tests with Databricks, enhancing pytest by providing native Databricks capabilities for resource management, Spark session handling and test isolation.
Here’s an example integration test showing these concepts in action:
@pytest.mark.integration_test
def test_job(spark, make_schema, ws, job_id):
catalog_name = "main"
# Ephemeral schema creation
schema_name = make_schema(catalog_name=catalog_name).name
# Trigger a Databricks Workflow run
run_wait = ws.jobs.run_now(
job_id=job_id,
python_params=[catalog_name, schema_name]
)
# Wait for run completion and validate success
run_result = run_wait.result()
result_status = run_result.state.result_state
assert result_status == RunResultState.SUCCESS
# Validate result data written to Unity Catalog table
df = spark.read.table(f"{catalog_name}.{schema_name}.my_table")
assert df.count() > 0
@pytest.fixture
def job_id(ws, request):
job_name = request.config.getoption("--job-name")
job_id = next((job.job_id for job in ws.jobs.list() if job.settings.name == job_name), None)
if job_id is None:
raise ValueError(f"Job '{job_name}' not found.")
return job_id
In this test:
This minimal example demonstrates both Pytester’s fixture-driven simplicity and its tight integration with Lakeflow jobs, enabling well-readable, fully automated tests optimized for CI/CD.
While Pytester manages Databricks resource orchestration and improves the test code structure, Databricks Connect powers the actual Spark execution on any Databricks compute, including serverless compute. Databricks Serverless provides instant, on-demand compute, which is critical for fast test completion within both inner and outer development loops.
Through Databricks Connect, tests:
Unit and integration tests depend on different Spark and Databricks runtime contexts. The UV package manager makes it easy to enforce dependency isolation across these layers through pyproject.toml dependency groups.
Example configuration from pyproject.toml:
[dependency-groups]
dev = [
"pytest>=8.3.4",
"databricks-labs-pytester"
]
unit-tests = [
# These dependencies will break integration tests relying on databricks-connect
"pyspark>=4.0.0,<5.0.0"
]
integration-tests = [
"databricks-connect==17.1.0"
]
Run integration tests:
#bash
uv sync --only-group integration-tests
uv run python -m pytest -rsx -m integration_test --job-name="<job-name-to-test>"
Run unit tests:
#bash
uv sync --only-group unit-tests
uv run python -m pytest -m unit_test
With isolated dependency groups, data teams can confidently test across multiple layers while maintaining a single, consistent repository. Shared dependencies under the dev group reduce duplication and ensure identical setups both locally and within CI/CD executions.
The following section showcases how this setup will run in an actual Databricks environment. The full demo code is available in the Databricks Blogposts GitHub repository.
Prerequisite: Install and setup the following dependencies
#bash
uv sync --only-group integration-tests
export DATABRICKS_HOST=<your-dev-workspace-url>
export DATABRICKS_CLUSTER_ID=<cluster-id> # Used to run Spark code on Databricks
export DATABRICKS_WAREHOUSE_ID=<warehouse-id> # Used by pytester fixtures to run SQL queries
# Optionally if you want to use serverless instead of classic compute, replace DATABRICKS_CLUSTER_ID with this
export DATABRICKS_SERVERLESS_COMPUTE_ID=auto
Set up a Databricks Assets Bundle for the main job which will make it easy to define and deploy the job to different target environments.
Example job definition:
# ./resources/workflow_test_automation_blueprint.job.yml
resources:
jobs:
workflow_test_automation_blueprint_job:
name: workflow_test_automation_blueprint_job
tasks:
- task_key: calculate_avg_trip_distance
python_wheel_task:
entry_point: main
package_name: ps_test_blueprint
parameters: ["main","default"]
environment_key: default
environments:
- environment_key: default
spec:
environment_version: '3'
dependencies:
- ../dist/*.whl
Example databricks.yml file
bundle:
name: workflow_test_automation_blueprint
include:
- resources/*.yml
artifacts:
default:
type: whl
build: uv build --wheel --package ps_test_blueprint
path: .
targets:
dev:
mode: development
default: true
Once the YAML files are defined and configured properly, deploy the main job to be tested toward a Databricks environment where you want to test the job. In this example, it will be deployed to a dev environment.
databricks bundle deploy -t dev
Next, execute the integration tests, which will set up the environment through fixtures, trigger the main job that was deployed in the previous step, and validate the results of the job run. This can be run multiple times for the same job deployed in the previous step.
uv run python -m pytest -rsx -m integration_test --job-name="workflow_test_automation_blueprint_job"
Once the test runs, a new test-specific schema is created to store the output tables and is passed to the job as a parameter. The main job started running as well.
The schema dummy_* is created by Pytester fixture specifically for the tests.
The main job ran successfully and the results are returned to the test to be validated. Marking the tests a success and producing the pytest report
The schema and any resources created specifically for the test are automatically deleted after the test run.
In this blog, we explored two approaches for integration testing of Lakeflow Jobs and presented a practical blueprint for Approach 2 using Pytest and Databricks Connect. By combining Pytest’s framework, including advanced fixture management through databricks-pytester, with Databricks Connect’s remote job execution, data teams gain faster feedback cycles, higher developer productivity, and more reliable data pipelines. This streamlined workflow empowers teams to test the entire lifecycle, from orchestration and external dependencies to data persistence, directly within their IDE or CI environment.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.