Databricks Community

Razi_Bayati

Introduction

When should I test my data? How should it be done? Which tools and testing methods are most effective? Who holds responsibility for data quality?”

These are common questions that arise in discussions around data testing, often sparking meetings and hours of (sometimes conflicting) conversations across teams. No matter the solutions chosen or who ultimately leads the testing effort, there is one accepted truth: to trust your data product, you have to test it and for that, your product should be explainable, observable and controllable; Moreover, a strong testing strategy promotes responsible and ethical AI, ensuring that data products are reliable and that development teams remain accountable.

I highly recommend reading Data Ops Manifesto and its principles here https://dataopsmanifesto.org/en/ and sharing it with your team members. Rather than rigid processes and tools, DataOps values individuals and interactions, prioritizes actionable analytics, encourages customer collaboration, and promotes experimentation and cross-functional ownership. By embracing these principles, teams can ensure data quality and operational efficiency throughout the pipeline.

To tackle the complexities of data testing, it’s essential to grasp the fundamentals, define clear roles and responsibilities, and adapt these principles to the unique demands of data pipelines. In this article, I’ll guide you through an overarching testing framework, covering the what, how, and who of data testing. In the second article (here), we’ll delve into the practical tactics and methods needed to implement a robust testing framework. Let’s embark on this journey to explore the world of data testing together.

What is testing?

The concept of testing is often illustrated through Mike Cohn’s test pyramid. There are so many versions of this pyramid, but I find the simplest version more relatable. This pyramid emphasizes that foundational tests — those at the base — should make up the majority of testing efforts, focusing on simpler, faster tests. As we move up, tests become more complex, slower, and resource-intensive.

Here’s a breakdown of the testing levels and my attempt to translate it into data world.

Unit testing

Unit tests are the fastest and most fundamental, designed to cover small, isolated components or functions. Each unit test focuses on individual actions within a component, such as a line of code, a feature in a machine learning model, or a data pipeline segment (e.g., Bronze, Silver, and Gold layers in medallion architecture). To apply unit testing effectively, identify each component of your product, anticipate potential points of failure, and test each in isolation.

You may ask how granular we should go for choosing the component we want to test. This is a really good question and it depends on the industry and testing strategy for your organization. Some industries have full coverage testing strategies which not only test for each line of code but also for every single decision in the code (look at MC/DC code coverage). These decisions impact testing costs, team resources, and risk tolerance levels.

Integration testing

Once individual units are validated, integration testing ensures they work together as expected. In a data pipeline using medallion architecture, this could involve verifying that data moves smoothly from the Bronze (raw data) layer through Silver (transformation) to Gold (curated data). For example, testing the integration between the data extraction (Bronze) and transformation (Silver) steps involves validating data flow and transformation accuracy. Mock data can help validate each stage, ensuring expected outcomes at every level.

In modern data workflows, integration tests are often managed or supported by the data platform itself, which can automate these processes. In the next issue I’ll be explaining many aspects of Databricks capabilities for testing.

E2E testing

End-to-End testing simulates a complete transaction, verifying the entire workflow from data ingestion to the final product. It combines functional, performance, security, and user experience testing to ensure that end users can interact with the product as intended. For example, in a data analytics setting, an analyst checks if data is usable, meaning it is regularly refreshed, and accurate. This type of testing typically requires a quality assurance (QA) environment where all components are in place, allowing you to validate the product’s end-to-end performance from a user’s perspective.

Mock data plays a critical role in integration and E2E testing in each stage of an ETL pipeline, ensuring expected outcomes at every level. Here’s how to make the most of it:

Diverse Data Types: Include a variety of data types to reflect real-life scenarios, covering all possible values the pipeline may encounter.
Edge Case Coverage: Ensure mock data includes edge cases, such as extreme values and unusual characters, and is scaled realistically to match production volume.
Schema Alignment: Match the production schema in the mock data, ensuring the same data types, lengths, and formats. If working with relational data, maintain foreign key relationships to simulate dependencies accurately.

For testing consider:

Data Quality Simulation: Introduce common data quality issues — duplicates, missing fields, and inconsistent formatting — to test data validation and error-handling capabilities.
Data Integrity Testing: Create scenarios where data integrity is intentionally compromised to observe how the pipeline responds to these errors.
Transformation Logic: Reflect all expected transformations, including aggregations, joins, and calculations, to verify that transformation logic functions as intended. Add some outlier cases that may not transform as expected to test error handling in the ETL pipeline.
Load Testing: Ensure the data format and structure align with the target system’s requirements. Account for destination constraints, such as primary keys, unique indexes, or storage limits, to test loading functionality comprehensively.

Finally, maintain a consistent mock dataset and consider automating mock data generation. This will allow for quick adjustments if there are changes to the schema or pipeline requirements.

Who is responsible for testing?

Testing is a shared responsibility across data teams, where accountability for quality must be distributed among all stakeholders. However, it’s essential to designate clear roles to avoid duplicate efforts and ensure comprehensive coverage, even better having a leading team to oversee and lead direction and guide the way of teams working together, knows each team’s focus, so that it will assign test to right team, and prevents duplicate test and no-owner situation

I recommend these roles for testing and end to end data products, but that they can and should be tailored to each team’s requirements and work. Each one of these data role brings unique expertise to the testing process:

Data Engineers (DE) focus on the pipeline’s structural integrity and data transformations.
Data Analysts (DA) verify data quality and usability for reporting and analysis.
Data Scientists (DS) ensure model inputs and outputs meet expectations, particularly in training and testing data.
Machine Learning Engineers (MLE) validate feature engineering and monitor model performance in production.
Business Intelligence (BI) Teams confirm that insights generated align with business needs.

Depending on the product we want to test, some teams might or might not work together in a test project but we know

Teams have deep knowledge of their product
Teams are responsible for quality of their product
Teams run the innovation and development of their products
Depending on the product type, frequency of products types and organization hierarchy not every role needs to be in the roles and responsibilities discussion.

Establishing Ownership through RACI

A RACI (Responsible, Accountable, Consulted, and Informed) matrix is a useful framework to define testing responsibilities. Depending on the product scope and organizational structure, each team’s level of involvement may vary, but collectively, they ensure testing accountability across data quality, transformation logic, and final product usability.

Example: Testing Responsibility Breakdown for data pipeline ingestion

This table illustrates how you can allocate responsibility. Each organization can customize this based on selected tests, team dynamics and product complexity.

Process of testing

Effective testing begins with a solid framework, involving two main components: the testing strategy and the testing plan. These should be defined initially and reviewed regularly to ensure alignment with organizational goals and industry best practices.

Define the Test Strategy (What to Test)

The test strategy provides a high-level overview, answering key questions:

Objectives: What are the primary goals of testing? (e.g., data accuracy, performance)
Scope: What specific components require testing, and who is the end user? This includes identifying relevant types of testing such as unit, integration, performance, and user experience testing.
Risk and Cost: Every team has a tradeoff to make between quality control and fast time to market based on the cost of errors and project complexity. Ask yourself what can go wrong? What are the risks and costs associated with it? What are the risks if testing fails or not gets executed, and what resources are required?
Success Criteria: Define the “finish line” for testing. What outcomes signify a successful test process?

Establish the Test Plan (How to Test)

The test plan translates strategy into tactics, :

Methods and Tools: Decide on the platforms, tools, and frameworks for testing.
Testing Cycles: Determine the number of testing cycles required to meet quality standards.
Coordination: Use a RACI framework to define roles within the testing process.
Timeline: Set a timeline for testing activities, prioritizing key tasks to achieve maximum impact efficiently.

Action plan

After the strategy and plan are in place, it’s time to execute:

Identify Risks: With each team, identify potential points of failure in their area.
Design Tests: Create specific test cases, starting with fundamental unit tests.
Prioritize Tests: Adopt a risk-based approach to prioritize tests according to their impact and urgency. Consider creating a complexity-value matrix to map test cases based on their complexity and the value they provide, helping to focus efforts on the most critical areas.
Assign Responsibility: Clearly define who is responsible, accountable, consulted, and informed for each test type.
Automate Where Possible: For efficiency, automate tests that can be consistently repeated.

Each step should be revisited as needed, ensuring that testing adapts to the development cycle and evolving product requirements.

Conclusion

In today’s data-driven landscape, rigorous testing is not just a quality assurance measure; it’s essential for building trust in data products. From unit tests on individual components to integration and end-to-end evaluations, a well-structured testing process ensures that data products are reliable, accurate, and ready to meet users’ needs. By defining clear roles, establishing a cohesive strategy, and implementing a detailed action plan, organizations can foster accountability and create a robust foundation for data quality.

In the second part of this series, we’ll explore a practical example focused on testing a data ingestion layer, detailing specific tactics and real-world testing strategies. This hands-on approach will provide actionable insights into implementing a test-driven data pipeline, enhancing quality at each stage from data ingestion to final delivery.

Databricks Community

Building Trust in Data I: A Guide to Effective Data Testing Frameworks