cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Running Spark Tests

maikel
Contributor II

Hello Community!

writing to you with the question about what are the best way to run spark unit tests in databricks. Currently we have a set of notebooks which are responsible for doing the operations on the data (joins, merging etc.).
Of course to do not keep everything in the notebooks we have separate directory for python functions. Very often they use spark code. For now the only way to to test those functions is to mock Spark but this is not how we would like to keep it since we mocking the outputs so real Spark behavior is skipped. The problem is that to run spark tests we need databricks environment. Maybe another important info is that we use serverless for the calculations since we do not want to wait for the cluster to wake up. 

Do you have any suggestions how to write some nice spark tests and be able to run them in Databricks env?

Thanks a lot!

2 REPLIES 2

lingareddy_Alva
Esteemed Contributor

Hi @maikel 

1. Databricks Connect (Best fit for your situation)
This is likely your best path. It lets you run Spark code locally or in CI against a real Databricks cluster/serverless compute, meaning:

 - Real Spark behavior, no mocking
 - Tests run from your local machine or CI pipeline (GitHub Actions, Azure DevOps, etc.)
 - You write standard pytest tests
 - Serverless compute is supported as of Databricks Connect v2 (DBR 13+)
Your code and tests run locally, but all actual Spark execution happens on Databricks. No mocking, real Delta, real Unity Catalog.
Before writing any tests, verify your connection works:

2. Nutter (Databricks-native notebook testing)
If your logic is tightly coupled to notebooks, Nutter is a framework by Microsoft specifically for testing Databricks notebooks. It runs notebooks as tests inside the Databricks environment.
Good if you want to test notebook-level behavior, but less clean for pure function unit tests.

Regards,
LR

 

LR

Louis_Frolio
Databricks Employee
Databricks Employee

Great suggestions  @lingareddy_Alva  regarding Databricks Connect v2!

@maikel ,

A few things to layer on top of that.

First, the fact that you already have your functions in a separate directory outside of notebooks is exactly the right foundation. That separation is what makes real testing possible, so you're ahead of a lot of teams on this.

  1. Separate pure Python from Spark-dependent code

This is the highest-leverage move you can make. For your Python functions directory, look at which functions actually need Spark and which don't. Functions that build filter conditions, transform column names, assemble config, or operate on plain Python types can be tested with plain pytest, no Databricks needed at all. If some of your functions use Polars (your title mentions it), those can also be tested entirely locally.

Then have thin adapter functions that translate to/from DataFrames and call that pure logic. This lets you run fast local unit tests for the bulk of your logic and reserve Databricks-backed tests for the parts that truly depend on Spark (joins on DataFrames, Delta reads/writes, Unity Catalog integration).

  1. Databricks Connect v2 + pytest for Spark integration tests

This is the core recommendation, and it fits your serverless constraint well. With Databricks Connect v2 (DBR 13+), your tests run locally or in CI, but all Spark execution happens on Databricks serverless. Real optimizer, real shuffles, real Delta, real Unity Catalog, no mocking.

A common pattern is a session-scoped pytest fixture in conftest.py:

 
 
# conftest.py
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    spark = SparkSession.builder.getOrCreate()
    yield spark
    spark.stop()

Then your tests create small real DataFrames and assert on outputs:

 
 
def test_join_logic(spark):
    left = spark.createDataFrame([("a", 1), ("b", 2)], ["key", "val"])
    right = spark.createDataFrame([("a", 10)], ["key", "val2"])

    result = my_join_fn(left, right)

    rows = {tuple(r) for r in result.collect()}
    assert rows == {("a", 1, 10), ("b", 2, None)}
  1. Running tests inside Databricks

Two paths here depending on your setup:

Path A (lower barrier): If your code and tests live in Databricks Repos, you can run pytest directly on compute. Use %pip install pytest in a notebook, then !python -m pytest. This gives you real Spark on serverless with zero local setup.

Path B (CI/CD ready): Package your Python code and tests as a wheel. Create a Job on serverless compute that runs pytest as the entry point. Wire this into Databricks Asset Bundles (DABs) so your CI pipeline can deploy the wheel, trigger the test job, and gate promotion on test results. This is the more production-grade path and gives you repeatable "run all Spark tests in Databricks env" as part of your pipeline.

 

  1. Where Nutter fits

Use Nutter only when you need to test whole notebooks: widgets, dbutils calls, orchestration between cells. For testing the Python functions in your separate directory, pytest + Databricks Connect is the cleaner path.

The ideal end state for your notebooks is thin orchestration layers that read parameters, load inputs, call your tested functions, and write outputs. The heavy logic lives in your Python modules where pytest can reach it.

 

  1. Putting it together

A setup that fits your constraints (serverless, real Spark, minimal mocking):

  • Unit tests (fast, local): pure Python logic and any Polars functions. Plain pytest, no Databricks.
  • Spark integration tests (local runner pointing to Databricks serverless): pytest + Databricks Connect, creating small real DataFrames.
  • Optional notebook E2E tests (in workspace): Nutter on serverless Jobs compute for the few cases where you need to test notebook-level behavior.

This keeps test feedback fast, stays close to real Spark behavior, and fully leverages the serverless environment you already prefer.

Hope that helps, Maikel.

Cheers, Lou