cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Noopur_Nigam
Databricks Employee
Databricks Employee

Unit testing is a vital part of software development, ensuring that individual components of your code function as intended. However, when working with PySpark in Databricks, many teams find themselves skipping or minimising unit tests due to a range of unique challenges. Spark's distributed nature, the interactive workflow of Databricks notebooks, and the integration with Databricks-specific tools can make traditional testing approaches feel cumbersome or out of place.

Common issues include difficulty creating isolated test environments, challenges in mocking Spark dataframes and Databricks utilities, and uncertainty about how to automate tests within the Databricks ecosystem. As a result, teams may rely heavily on manual testing or integration tests, which are slower, harder to maintain, and more likely to let bugs slip through.

Yet, skipping unit tests comes at a cost: without them, bugs in data transformations or business logic can make their way into production, leading to costly data errors, pipeline failures, or customer impact. Well-designed unit tests catch issues early, accelerate development, and provide confidence during refactoring or scaling.

This blog outlines practical strategies for writing unit tests for PySpark applications in Databricks. We’ll discuss common pitfalls, share actionable tips, and highlight best practices to help you build a robust testing culture-ensuring your Spark code is reliable, maintainable, and production-ready, even within the unique context of Databricks.

Why Unit Testing in Databricks is Different

Databricks provides a managed platform for big data processing and machine learning that leverages Apache Spark. However, testing PySpark code within Databricks comes with unique challenges:

  • Runtime-Specific Libraries: Code often relies on Databricks utilities like dbutils, which are unavailable outside the Databricks environment.
  • Global SparkSession: The SparkSession provided by Databricks is automatically initialised and may not be accessible outside its runtime.
  • Notebook-Based Workflows: Many workflows are written in notebooks, which can complicate modular testing.

Despite these challenges, unit testing remains essential to ensure code reliability and maintainability.

Strategies for Writing Unit Tests in Databricks

1. Refactor Code for Better Testability

To make your PySpark code easier to test:

  • Extract Transformation Logic: Move data processing logic into standalone Python functions or modules.
  • Minimise Direct Dependencies on dbutils: Use dependency injection or mock objects to replace dbutils calls during testing.
  • Control Notebook Execution: Wrap main execution logic in if __name__ == "__main__" blocks to prevent it from running during imports.

Example:

# databricks_notebook.py

from pyspark.sql import functions as F

def process_data(df):
    return df.select("name", "birthDate").filter(F.col("dob") >= F.lit("2000-01-01"))

if __name__ == "__main__":
    # Main notebook logic
    uc_volume_path = "volume://my_catalog.my_schema.my_volume/my_data"
    dbutils.fs.mkdirs(uc_volume_path)

By isolating the process_data function, you can test it independently without invoking the notebook's runtime-specific code.

2. Use Pytest with Fixtures

Pytest offers a flexible framework for writing tests. It allows you to use fixtures to set up reusable resources like a SparkSession.

Example:

import pytest
from pyspark.sql import SparkSession
from databricks_notebook import process_data

@pytest.fixture(scope="session")
def spark_session():
    return SparkSession.builder.master("local[*]").appName("PyTest").getOrCreate()

def test_process_data(spark_session):
    data = [("Alpha", "2000-01-01"), ("Beta", "1980-05-12")]
    schema = StructType([
        StructField("name", StringType(), True),
        StructField("birthDate", StringType(), True)
    ])
    df = spark_session.createDataFrame(data, schema)
    result_df = process_data(df)

    assert result_df.count() == 1

This approach ensures modularity and allows you to reuse the same SparkSession across multiple test cases.

3. Organise Code Using Databricks Repos

Databricks Repos enable better organisation of your code by storing PySpark functions in .py files within a repository. Using Databricks Asset Bundles (DABs), you can define and deploy resources programmatically for CI/CD workflows with simple YAML configurations. Tests can be written in separate files (e.g., test_functions.py) using frameworks like pytest and executed directly in Databricks Notebooks.

Steps to Run Tests in Databricks Notebooks:

Install pytest:

%pip install pytest

Run tests:

import pytest

retcode = pytest.main([".", "-v", "-p", "no:cacheprovider"])

assert retcode == 0, "Some tests failed!"

This setup integrates seamlessly with Databricks Repos, DABs, version control systems, and CI/CD pipelines, enhancing code modularity and reliability.

4. Validate DataFrames Using Equality Checks

Starting from Apache Spark 3.5 (and Databricks Runtime 14.2), built-in methods like assertDataFrameEqual and assertSchemaEqual simplify DataFrame validation.

Example:

from pyspark.testing.utils import assertDataFrameEqual

def test_dataframe_equality(spark_session):
    df1 = spark_session.createDataFrame([("Alpha", 20)], ["name", "age"])
    df2 = spark_session.createDataFrame([("Alpha", 20)], ["name", "age"])

    assertDataFrameEqual(df1, df2)  # Passes if both DataFrames are identical

These methods are particularly useful for verifying complex transformations or schema changes.

5. Mocking Databricks-Specific Libraries

Databricks utilities like dbutils can be mocked using Python's unittest.mock module to simulate their behaviour during testing.

Example:

from unittest.mock import MagicMock

def test_dbutils_interaction():
    mock_dbutils = MagicMock()
    mock_dbutils.fs.mkdirs.return_value = None

   # Simulate function call
    uc_volume_path = "volume://my_catalog.my_schema.my_volume/my_data"
    mock_dbutils.fs.mkdirs(uc_volume_path)

    mock_dbutils.fs.mkdirs.assert_called_once_with(uc_volume_path)

This ensures that your tests remain independent of the actual Databricks runtime environment.

Best Practices for Unit Testing PySpark Code in Databricks

  1. Isolate Business Logic: Keep transformation logic separate from runtime-specific operations like I/O or utility calls.
  2. Use Synthetic Data for Testing: Create small sample datasets within your test cases instead of relying on production data.
  3. Optimise SparkSession Usage: Share a single SparkSession across tests to reduce initialisation overhead.
  4. Integrate Testing into CI/CD Pipelines: Automate testing using tools like GitHub Actions or Azure DevOps.
  5. Test Locally Before Deployment: Validate your code locally before running it in the Databricks environment.

Conclusion

Writing unit tests for PySpark applications in Databricks requires thoughtful structuring of code and careful handling of runtime dependencies. By leveraging tools like pytest, built-in DataFrame comparison methods, and mocking libraries, you can create reliable tests that ensure your data pipelines are robust and error-free. These strategies will help you build scalable applications on the Databricks Platform while minimising bugs and improving code quality.

5 Comments