Databricks Community

KavyaN27 · ‎01-22-2024

Incomplete time series data
This article is part of a series on time series analysis in partnership with the University of Koblenz. See our other articles on forecasting and anomaly detection.

In the age of cloud computing, where data reigns supreme, ensuring data quality has never been more critical. Time series data, with its dynamic nature and diverse applications, demands special attention in cloud environments. In this blog post, we explore a Data Quality Management Framework tailored for time series data in the cloud, applied here to the UCI Air Quality dataset.

Framework Overview
Time series data underpins crucial decisions in today's data-driven landscape. The framework discussed in this blog emerged from the need to effectively address data reliability challenges with time series data. To ensure its accuracy, consistency, and completeness, we needed a specialized toolset. We aim to provide data professionals with a user-friendly solution that not only detects issues but actively enhances data quality. By simplifying the process we can make high-quality time series data accessible to all, fostering data-driven excellence across industries.

At its core, the data quality framework comprises two essential components: the Checker and the Improver. These components work together to comprehensively assess data quality and enact meaningful improvements.

Figure 1 : Framework Overview

Checker
The Checker component comprises a wide range of functions to inspect data for inconsistencies, missing values, outliers, and other quality-related issues. With its ability to analyze large datasets swiftly, the Checker ensures that your data meets the highest standards. Here's a brief overview of its key features:

Completeness Assessment
Helps identify missing values within the dataset, providing insights into the extent of missing data. The framework calculates the number of missing values for each day, allowing users to see how complete or incomplete the data is on a daily basis. Moreover, it goes beyond just counting missing values; it also identifies periods where data coverage is notably poor. These 'poor coverage' periods are flagged, providing a clear indication of when the data may be less reliable.
Skewness Calculation
Calculates skewness for numeric columns, aiding in understanding data distribution.
Bias Detection
Quantifies bias within a target column, ensuring data fairness and accuracy.
Stationarity Check
Assesses the stationarity of numeric columns, crucial for selecting appropriate time series analysis methods.
Time Shift Analysis
Calculates time data reigns supreme for columns of interest, revealing temporal relationships between variables.

Improver
Where the Checker identifies areas for improvement, the Improver component steps in to elevate your data quality. Unlike conventional solutions that offer only basic fixes, the Improver class provides a set of advanced options. For instance, when addressing missing values, you're not limited to simple imputation methods; you have the flexibility to choose from a range of techniques, including machine learning-based imputation models. This empowers to tailor data quality improvements according to specific needs. Let's take a closer look at Improver capabilities.

Count Missing Values
Determines the number of missing values in the dataset. It's essential for understanding data completeness.
Impute Missing Values
Includes imputation techniques like forward/backward filling (impute_forward_back_fill), linear interpolation (impute_linear_interpolation), and seasonal decomposition (impute_seasonal_decomposition). These methods help replace missing data with meaningful estimates.
Plot Air Quality Data
The plot_air_quality_data method assists in visualizing air quality data. Users can specify the timestamp and column name to plot. It also allows overlaying imputed and resampled data for comparison, aiding in data analysis and visualization.
Align Frequencies
The align_frequencies method is useful for aligning the frequencies of timestamped data and ensuring their consistency. Users can specify the timestamp column, value column, and target frequency to resample the data.
Handle Duplicates
The handle_duplicates method helps identify and remove duplicate rows in the dataset, ensuring data integrity.
Smooth Outliers
The smooth_outliers method identifies outliers in numeric columns and smoothes the data by applying a moving average, making it easier to handle noisy data.
Improve Stationarity
The improve_stationarity method supports making data stationary by differencing non-stationary time series data. This is crucial for time series analysis and modeling.
Improve Time Shifts
The improve_time_shifts method analyzes time shifts in a target column, helping users discover temporal relationships between variables. It automatically corrects time shifts by interpolating data.

These quality scores and insights are presented on a user-friendly dashboard, empowering data professionals to make informed decisions and ensuring data-driven excellence across industries. The Air Quality Data Quality Dashboard provides two key metrics when selecting specific columns of interest - the overall normalized consistency score and the overall normalized relevancy score.

Figure 3 : Overall consistency and Relevancy score

The Air Quality Data Quality Dashboard provides users with a comprehensive overview of important data quality indicators. This includes information on missing data points, which currently stands at 20,652, indicating a data completeness level of 87.17%. The skewness of individual columns like CO(GT), PT08.S1(CO), and NMHC(GT) is also provided, giving insights into the distribution of the data. Moreover, users can determine the stationarity status of these columns, with most being found to be stationary. This ensures that the data is reliable for further analytical processes. With this complete snapshot, users can confidently evaluate the quality of their time series data and make informed decisions accordingly.

The created framework when put to test on the UCI Air quality dataset yielded the following results. The data had a 87.17% completeness rate before improvement and 99.93% completeness rate after. The missing count of the data identified as 20652 was later filled with imputation methods in the Improver class. Using check_missing_data(), temporal analysis revealed days with low coverage, such as "2004-03-11" were identified. Statistical analysis such as skewness showed that columns CO (GT), PT08.S1(CO), C6H6 (GT), PT08.S2 (NMHC), PT08.S3 (NOx), NO2 (GT), PT08.S4 (NO2), PT08.S5 (O3), T, RH, and AH all exhibit negative skewness which indicates a bias towards higher values while columns NMHC (GT) and NOx (GT) display positive skewness indicating bias towards lower values. Chemical sensor values (columns like CO (GT), PT08.S1 (CO), NMHC (GT) were non-stationary, meteorological characteristics like Temperature, Relative Humidity (RH) and Absolute Humidity (AH) were indicated stationary.

Exploring imputation methods
20,652 rows in our dataset were identified as containing missing values. A sample of the data used as input to the Impute method can be seen in Figure 4 below:

Figure 4: Sample input for impute methods

We used three different imputation methods to fill in gaps and improve consistency.

Forward-backward fill
In the next figure we see the missing values filled using forward_backward_fill. The method uses the last known value forward and, if necessary, the next known value backward to fill in gaps.

Figure 5 : Output of forward-backward fill method

Linear interpolation
The impute_linear_interpolation function copies the records from the selected column and then applies linear interpolation. This consists of estimating missing values through creating a linear relationship between the points, then filling in missing values by plotting the missing data over the line. You can see the results of linear interpolation in Figure 6:

Figure 6 : Imputation with linear interpolation

Seasonal decomposition
The impute_seasonal_decomposition function first preprocesses the the dataset, converting ‘Date’ and ‘Time’ strings to datetime objects and merging them into one ‘Datetime’ column. Then it performs seasonal decomposition on the selected column, isolating seasonal patterns from the data. Missing values are imputed with the seasonal component, aligning with the seasonality. Figure 7 shows the results of seasonal decomposition:

Figure 7: Imputation with seasonal decomposition

Frequency alignment
An hourly frequency alignment of CO (GT) allows for consistent temporal analysis. The align_frequencies method aligns values with target frequencies, making it easier to perform comparisons over a period of time. Noise in the measurements was reduced using outlier smoothing. The smooth_outlier method selects target columns and applies a moving average with a specified window size, mitigating sudden fluctuations in the data. Results of this method are shown in Figure 8:

Figure 8 : Imputation with frequency alignment

Additionally, it also identified outliers by calculating z-scores, considering any data points exceeding a predetermined threshold as outliers. To facilitate further analysis, the method introduced two new columns: ’is_outlier,’ which flags data points as outliers, and ’smoothened_rows,’ which records the names of columns with outliers for each row. The duplicates discovered were dropped. By differencing techniques, stationarity was enhanced for columns like CO (GT) and PT08.S1 (CO). The framework's ability to detect and rectify data issues makes it a valuable asset for any organization relying on time series data for decision-making.

Future Work
The framework has a wide range of potential applications in the future. Extending the compatibility of data sources is one important option. While the framework currently supports data upload from zip files, it could be expanded to easily interface with various database systems, enhancing its use. More time series-specific validation techniques, such as change point identification, might be added to the existing toolkit of techniques. Robustness would be increased using various time indexes like DateTimeIndex with a range of frequencies. Based on evaluation, automated recommendations might be included. The framework's usefulness would be further demonstrated by testing it on additional varied datasets. These provide intriguing directions for further research.

Our experience with Databricks
Databricks played a crucial role in effectively conducting our research project. Our team collaborated through notebooks for interactive code development and data analysis, allowing us to efficiently track work progress, test multiple approaches, and share expertise in data quality improvement. Databricks Repos kept our code synced to the Git repository and facilitated change tracking and code versioning. With Databricks, we could effectively analyze large datasets without resource constraints.

Acknowledgement
The blog post was authored by Kavya Sasikumar with contributions from Aravind RK and Deepthy Paulose. We would like to express our sincere gratitude to Prof. Dr. Frank Hopfgartner for his invaluable guidance, support, and mentorship throughout the course of this project. His expertise and insights were instrumental in shaping this report. We would also like to extend our heartfelt thanks to Evgeny Chernyi from Databricks
for his collaboration and assistance in providing valuable data and insights. His contributions significantly enriched the quality of this research. Lastly, we would like to acknowledge the Universität Koblenz for providing the resources and the environment necessary for conducting this research.

Resources
Github link for the source code
UCI Air quality dataset

Databricks Community

Enhancing Time Series Data Quality in Cloud : A Comprehensive Framework

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks