This article is part of a series on time series analysis in partnership with the University of Koblenz. See our other articles on forecasting and anomaly detection.
In the age of cloud computing, where data reigns supreme, ensuring data quality has never been more critical. Time series data, with its dynamic nature and diverse applications, demands special attention in cloud environments. In this blog post, we explore a Data Quality Management Framework tailored for time series data in the cloud, applied here to the UCI Air Quality dataset.
Framework Overview
Time series data underpins crucial decisions in today's data-driven landscape. The framework discussed in this blog emerged from the need to effectively address data reliability challenges with time series data. To ensure its accuracy, consistency, and completeness, we needed a specialized toolset. We aim to provide data professionals with a user-friendly solution that not only detects issues but actively enhances data quality. By simplifying the process we can make high-quality time series data accessible to all, fostering data-driven excellence across industries.
At its core, the data quality framework comprises two essential components: the Checker and the Improver. These components work together to comprehensively assess data quality and enact meaningful improvements.
Checker
The Checker component comprises a wide range of functions to inspect data for inconsistencies, missing values, outliers, and other quality-related issues. With its ability to analyze large datasets swiftly, the Checker ensures that your data meets the highest standards. Here's a brief overview of its key features:
Improver
Where the Checker identifies areas for improvement, the Improver component steps in to elevate your data quality. Unlike conventional solutions that offer only basic fixes, the Improver class provides a set of advanced options. For instance, when addressing missing values, you're not limited to simple imputation methods; you have the flexibility to choose from a range of techniques, including machine learning-based imputation models. This empowers to tailor data quality improvements according to specific needs. Let's take a closer look at Improver capabilities.
These quality scores and insights are presented on a user-friendly dashboard, empowering data professionals to make informed decisions and ensuring data-driven excellence across industries. The Air Quality Data Quality Dashboard provides two key metrics when selecting specific columns of interest - the overall normalized consistency score and the overall normalized relevancy score.
The Air Quality Data Quality Dashboard provides users with a comprehensive overview of important data quality indicators. This includes information on missing data points, which currently stands at 20,652, indicating a data completeness level of 87.17%. The skewness of individual columns like CO(GT), PT08.S1(CO), and NMHC(GT) is also provided, giving insights into the distribution of the data. Moreover, users can determine the stationarity status of these columns, with most being found to be stationary. This ensures that the data is reliable for further analytical processes. With this complete snapshot, users can confidently evaluate the quality of their time series data and make informed decisions accordingly.
The created framework when put to test on the UCI Air quality dataset yielded the following results. The data had a 87.17% completeness rate before improvement and 99.93% completeness rate after. The missing count of the data identified as 20652 was later filled with imputation methods in the Improver class. Using check_missing_data(), temporal analysis revealed days with low coverage, such as "2004-03-11" were identified. Statistical analysis such as skewness showed that columns CO (GT), PT08.S1(CO), C6H6 (GT), PT08.S2 (NMHC), PT08.S3 (NOx), NO2 (GT), PT08.S4 (NO2), PT08.S5 (O3), T, RH, and AH all exhibit negative skewness which indicates a bias towards higher values while columns NMHC (GT) and NOx (GT) display positive skewness indicating bias towards lower values. Chemical sensor values (columns like CO (GT), PT08.S1 (CO), NMHC (GT) were non-stationary, meteorological characteristics like Temperature, Relative Humidity (RH) and Absolute Humidity (AH) were indicated stationary.
Exploring imputation methods
20,652 rows in our dataset were identified as containing missing values. A sample of the data used as input to the Impute method can be seen in Figure 4 below:
We used three different imputation methods to fill in gaps and improve consistency.
Forward-backward fill
In the next figure we see the missing values filled using forward_backward_fill. The method uses the last known value forward and, if necessary, the next known value backward to fill in gaps.
Linear interpolation
The impute_linear_interpolation function copies the records from the selected column and then applies linear interpolation. This consists of estimating missing values through creating a linear relationship between the points, then filling in missing values by plotting the missing data over the line. You can see the results of linear interpolation in Figure 6:
Seasonal decomposition
The impute_seasonal_decomposition function first preprocesses the the dataset, converting ‘Date’ and ‘Time’ strings to datetime objects and merging them into one ‘Datetime’ column. Then it performs seasonal decomposition on the selected column, isolating seasonal patterns from the data. Missing values are imputed with the seasonal component, aligning with the seasonality. Figure 7 shows the results of seasonal decomposition:
Frequency alignment
An hourly frequency alignment of CO (GT) allows for consistent temporal analysis. The align_frequencies method aligns values with target frequencies, making it easier to perform comparisons over a period of time. Noise in the measurements was reduced using outlier smoothing. The smooth_outlier method selects target columns and applies a moving average with a specified window size, mitigating sudden fluctuations in the data. Results of this method are shown in Figure 8:
Additionally, it also identified outliers by calculating z-scores, considering any data points exceeding a predetermined threshold as outliers. To facilitate further analysis, the method introduced two new columns: ’is_outlier,’ which flags data points as outliers, and ’smoothened_rows,’ which records the names of columns with outliers for each row. The duplicates discovered were dropped. By differencing techniques, stationarity was enhanced for columns like CO (GT) and PT08.S1 (CO). The framework's ability to detect and rectify data issues makes it a valuable asset for any organization relying on time series data for decision-making.
Future Work
The framework has a wide range of potential applications in the future. Extending the compatibility of data sources is one important option. While the framework currently supports data upload from zip files, it could be expanded to easily interface with various database systems, enhancing its use. More time series-specific validation techniques, such as change point identification, might be added to the existing toolkit of techniques. Robustness would be increased using various time indexes like DateTimeIndex with a range of frequencies. Based on evaluation, automated recommendations might be included. The framework's usefulness would be further demonstrated by testing it on additional varied datasets. These provide intriguing directions for further research.
Our experience with Databricks
Databricks played a crucial role in effectively conducting our research project. Our team collaborated through notebooks for interactive code development and data analysis, allowing us to efficiently track work progress, test multiple approaches, and share expertise in data quality improvement. Databricks Repos kept our code synced to the Git repository and facilitated change tracking and code versioning. With Databricks, we could effectively analyze large datasets without resource constraints.
Acknowledgement
The blog post was authored by Kavya Sasikumar with contributions from Aravind RK and Deepthy Paulose. We would like to express our sincere gratitude to Prof. Dr. Frank Hopfgartner for his invaluable guidance, support, and mentorship throughout the course of this project. His expertise and insights were instrumental in shaping this report. We would also like to extend our heartfelt thanks to Evgeny Chernyi from Databricks
for his collaboration and assistance in providing valuable data and insights. His contributions significantly enriched the quality of this research. Lastly, we would like to acknowledge the Universität Koblenz for providing the resources and the environment necessary for conducting this research.
Resources
Github link for the source code
UCI Air quality dataset
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.