When should I test my data? How should it be done? Which tools and testing methods are most effective? Who holds responsibility for data quality?”
These are common questions that arise in discussions around data testing, often sparking meetings and hours of (sometimes conflicting) conversations across teams. No matter the solutions chosen or who ultimately leads the testing effort, there is one accepted truth: to trust your data product, you have to test it and for that, your product should be explainable, observable and controllable; Moreover, a strong testing strategy promotes responsible and ethical AI, ensuring that data products are reliable and that development teams remain accountable.
I highly recommend reading Data Ops Manifesto and its principles here https://dataopsmanifesto.org/en/ and sharing it with your team members. Rather than rigid processes and tools, DataOps values individuals and interactions, prioritizes actionable analytics, encourages customer collaboration, and promotes experimentation and cross-functional ownership. By embracing these principles, teams can ensure data quality and operational efficiency throughout the pipeline.
To tackle the complexities of data testing, it’s essential to grasp the fundamentals, define clear roles and responsibilities, and adapt these principles to the unique demands of data pipelines. In this article, I’ll guide you through an overarching testing framework, covering the what, how, and who of data testing. In the second article (here), we’ll delve into the practical tactics and methods needed to implement a robust testing framework. Let’s embark on this journey to explore the world of data testing together.
The concept of testing is often illustrated through Mike Cohn’s test pyramid. There are so many versions of this pyramid, but I find the simplest version more relatable. This pyramid emphasizes that foundational tests — those at the base — should make up the majority of testing efforts, focusing on simpler, faster tests. As we move up, tests become more complex, slower, and resource-intensive.
Here’s a breakdown of the testing levels and my attempt to translate it into data world.
Unit tests are the fastest and most fundamental, designed to cover small, isolated components or functions. Each unit test focuses on individual actions within a component, such as a line of code, a feature in a machine learning model, or a data pipeline segment (e.g., Bronze, Silver, and Gold layers in medallion architecture). To apply unit testing effectively, identify each component of your product, anticipate potential points of failure, and test each in isolation.
You may ask how granular we should go for choosing the component we want to test. This is a really good question and it depends on the industry and testing strategy for your organization. Some industries have full coverage testing strategies which not only test for each line of code but also for every single decision in the code (look at MC/DC code coverage). These decisions impact testing costs, team resources, and risk tolerance levels.
Once individual units are validated, integration testing ensures they work together as expected. In a data pipeline using medallion architecture, this could involve verifying that data moves smoothly from the Bronze (raw data) layer through Silver (transformation) to Gold (curated data). For example, testing the integration between the data extraction (Bronze) and transformation (Silver) steps involves validating data flow and transformation accuracy. Mock data can help validate each stage, ensuring expected outcomes at every level.
In modern data workflows, integration tests are often managed or supported by the data platform itself, which can automate these processes. In the next issue I’ll be explaining many aspects of Databricks capabilities for testing.
End-to-End testing simulates a complete transaction, verifying the entire workflow from data ingestion to the final product. It combines functional, performance, security, and user experience testing to ensure that end users can interact with the product as intended. For example, in a data analytics setting, an analyst checks if data is usable, meaning it is regularly refreshed, and accurate. This type of testing typically requires a quality assurance (QA) environment where all components are in place, allowing you to validate the product’s end-to-end performance from a user’s perspective.
Mock data plays a critical role in integration and E2E testing in each stage of an ETL pipeline, ensuring expected outcomes at every level. Here’s how to make the most of it:
For testing consider:
Finally, maintain a consistent mock dataset and consider automating mock data generation. This will allow for quick adjustments if there are changes to the schema or pipeline requirements.
Testing is a shared responsibility across data teams, where accountability for quality must be distributed among all stakeholders. However, it’s essential to designate clear roles to avoid duplicate efforts and ensure comprehensive coverage, even better having a leading team to oversee and lead direction and guide the way of teams working together, knows each team’s focus, so that it will assign test to right team, and prevents duplicate test and no-owner situation
I recommend these roles for testing and end to end data products, but that they can and should be tailored to each team’s requirements and work. Each one of these data role brings unique expertise to the testing process:
Depending on the product we want to test, some teams might or might not work together in a test project but we know
A RACI (Responsible, Accountable, Consulted, and Informed) matrix is a useful framework to define testing responsibilities. Depending on the product scope and organizational structure, each team’s level of involvement may vary, but collectively, they ensure testing accountability across data quality, transformation logic, and final product usability.
Example: Testing Responsibility Breakdown for data pipeline ingestion
This table illustrates how you can allocate responsibility. Each organization can customize this based on selected tests, team dynamics and product complexity.
Effective testing begins with a solid framework, involving two main components: the testing strategy and the testing plan. These should be defined initially and reviewed regularly to ensure alignment with organizational goals and industry best practices.
The test strategy provides a high-level overview, answering key questions:
The test plan translates strategy into tactics, :
After the strategy and plan are in place, it’s time to execute:
Each step should be revisited as needed, ensuring that testing adapts to the development cycle and evolving product requirements.
In today’s data-driven landscape, rigorous testing is not just a quality assurance measure; it’s essential for building trust in data products. From unit tests on individual components to integration and end-to-end evaluations, a well-structured testing process ensures that data products are reliable, accurate, and ready to meet users’ needs. By defining clear roles, establishing a cohesive strategy, and implementing a detailed action plan, organizations can foster accountability and create a robust foundation for data quality.
In the second part of this series, we’ll explore a practical example focused on testing a data ingestion layer, detailing specific tactics and real-world testing strategies. This hands-on approach will provide actionable insights into implementing a test-driven data pipeline, enhancing quality at each stage from data ingestion to final delivery.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.