cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Noopur_Nigam
Databricks Employee
Databricks Employee

Unit testing is a vital part of software development, ensuring that individual components of your code function as intended. However, when working with PySpark in Databricks, many teams find themselves skipping or minimising unit tests due to a range of unique challenges. Spark's distributed nature, the interactive workflow of Databricks notebooks, and the integration with Databricks-specific tools can make traditional testing approaches feel cumbersome or out of place.

Common issues include difficulty creating isolated test environments, challenges in mocking Spark dataframes and Databricks utilities, and uncertainty about how to automate tests within the Databricks ecosystem. As a result, teams may rely heavily on manual testing or integration tests, which are slower, harder to maintain, and more likely to let bugs slip through.

Yet, skipping unit tests comes at a cost: without them, bugs in data transformations or business logic can make their way into production, leading to costly data errors, pipeline failures, or customer impact. Well-designed unit tests catch issues early, accelerate development, and provide confidence during refactoring or scaling.

This blog outlines practical strategies for writing unit tests for PySpark applications in Databricks. We’ll discuss common pitfalls, share actionable tips, and highlight best practices to help you build a robust testing culture-ensuring your Spark code is reliable, maintainable, and production-ready, even within the unique context of Databricks.

Why Unit Testing in Databricks is Different

Databricks provides a managed platform for big data processing and machine learning that leverages Apache Spark. However, testing PySpark code within Databricks comes with unique challenges:

  • Runtime-Specific Libraries: Code often relies on Databricks utilities like dbutils, which are unavailable outside the Databricks environment.
  • Global SparkSession: The SparkSession provided by Databricks is automatically initialised and may not be accessible outside its runtime.
  • Notebook-Based Workflows: Many workflows are written in notebooks, which can complicate modular testing.

Despite these challenges, unit testing remains essential to ensure code reliability and maintainability.

Strategies for Writing Unit Tests in Databricks

1. Refactor Code for Better Testability

To make your PySpark code easier to test:

  • Extract Transformation Logic: Move data processing logic into standalone Python functions or modules.
  • Minimise Direct Dependencies on dbutils: Use dependency injection or mock objects to replace dbutils calls during testing.
  • Control Notebook Execution: Wrap main execution logic in if __name__ == "__main__" blocks to prevent it from running during imports.

Example:

# databricks_notebook.py

from pyspark.sql import functions as F

def process_data(df):
    return df.select("name", "birthDate").filter(F.col("dob") >= F.lit("2000-01-01"))

if __name__ == "__main__":
    # Main notebook logic
    uc_volume_path = "volume://my_catalog.my_schema.my_volume/my_data"
    dbutils.fs.mkdirs(uc_volume_path)

By isolating the process_data function, you can test it independently without invoking the notebook's runtime-specific code.

2. Use Pytest with Fixtures

Pytest offers a flexible framework for writing tests. It allows you to use fixtures to set up reusable resources like a SparkSession.

Example:

import pytest
from pyspark.sql import SparkSession
from databricks_notebook import process_data

@pytest.fixture(scope="session")
def spark_session():
    return SparkSession.builder.master("local[*]").appName("PyTest").getOrCreate()

def test_process_data(spark_session):
    data = [("Alpha", "2000-01-01"), ("Beta", "1980-05-12")]
    schema = StructType([
        StructField("name", StringType(), True),
        StructField("birthDate", StringType(), True)
    ])
    df = spark_session.createDataFrame(data, schema)
    result_df = process_data(df)

    assert result_df.count() == 1

This approach ensures modularity and allows you to reuse the same SparkSession across multiple test cases.

3. Organise Code Using Databricks Repos

Databricks Repos enable better organisation of your code by storing PySpark functions in .py files within a repository. Using Databricks Asset Bundles (DABs), you can define and deploy resources programmatically for CI/CD workflows with simple YAML configurations. Tests can be written in separate files (e.g., test_functions.py) using frameworks like pytest and executed directly in Databricks Notebooks.

Steps to Run Tests in Databricks Notebooks:

Install pytest:

%pip install pytest

Run tests:

import pytest

retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])

assert retcode == 0, "Some tests failed!"

This setup integrates seamlessly with Databricks Repos, DABs, version control systems, and CI/CD pipelines, enhancing code modularity and reliability.

4. Validate DataFrames Using Equality Checks

Starting from Apache Spark 3.5 (and Databricks Runtime 14.2), built-in methods like assertDataFrameEqual and assertSchemaEqual simplify DataFrame validation.

Example:

from pyspark.testing.utils import assertDataFrameEqual

def test_dataframe_equality(spark_session):
    df1 = spark_session.createDataFrame([("Alpha", 20)], ["name", "age"])
    df2 = spark_session.createDataFrame([("Alpha", 20)], ["name", "age"])

    assertDataFrameEqual(df1, df2)  # Passes if both DataFrames are identical

These methods are particularly useful for verifying complex transformations or schema changes.

5. Mocking Databricks-Specific Libraries

Databricks utilities like dbutils can be mocked using Python's unittest.mock module to simulate their behaviour during testing.

Example:

from unittest.mock import MagicMock

def test_dbutils_interaction():
    mock_dbutils = MagicMock()
    mock_dbutils.fs.mkdirs.return_value = None

   # Simulate function call
    uc_volume_path = "volume://my_catalog.my_schema.my_volume/my_data"
    mock_dbutils.fs.mkdirs(uc_volume_path)

    mock_dbutils.fs.mkdirs.assert_called_once_with(uc_volume_path)

This ensures that your tests remain independent of the actual Databricks runtime environment.

Best Practices for Unit Testing PySpark Code in Databricks

  1. Isolate Business Logic: Keep transformation logic separate from runtime-specific operations like I/O or utility calls.
  2. Use Synthetic Data for Testing: Create small sample datasets within your test cases instead of relying on production data.
  3. Optimise SparkSession Usage: Share a single SparkSession across tests to reduce initialisation overhead.
  4. Integrate Testing into CI/CD Pipelines: Automate testing using tools like GitHub Actions or Azure DevOps.
  5. Test Locally Before Deployment: Validate your code locally before running it in the Databricks environment.

Conclusion

Writing unit tests for PySpark applications in Databricks requires thoughtful structuring of code and careful handling of runtime dependencies. By leveraging tools like pytest, built-in DataFrame comparison methods, and mocking libraries, you can create reliable tests that ensure your data pipelines are robust and error-free. These strategies will help you build scalable applications on the Databricks Platform while minimising bugs and improving code quality.

5 Comments
TheOC
Contributor

Great article - thanks for the knowledge share!
Out of curiosity, it is common to measure performance (e.g execution/processing times) as part of unit tests? Or is that better handled in a separate testing / reporting function usually?

Noopur_Nigam
Databricks Employee
Databricks Employee

Thanks@TheOC. To answer your questions, performance metrics such as execution or processing times are generally not included in unit tests, especially in the context of Databricks and PySpark. The primary goal of unit testing is to ensure that individual components of your code function correctly and produce the right results in isolation.

TheOC
Contributor

Hey @Noopur_Nigam ,
Makes sense - So that would come under a separate arm of testing.

Thanks for getting back to me!
TheOC

__angel__
New Contributor III

My Tips:

  • Disable Adaptive Query Execution (AQE).
  • Significantly reduce spark.sql.shuffle.partitions (set it between 1 and the total number of available cores).

Since unit tests typically process a small amount of data, using the default values for these settings can negatively impact test performance.

Example: Simply reducing the number of shuffle partitions to 1 cut our unit test execution time from 14 minutes to just 7 minutes.

rka6
New Contributor

Great article ! using prod like synthetic data for unit testing will certainly improve the application reliability and test cases