Databricks Community

AvishekPathania · ‎01-08-2024

The need for anomaly detection across industries

This article is part of a series on time series analysis in partnership with the University of Koblenz. See our other articles on data quality and forecasting.

We live in a digital age where data is constantly generated, typically in chronological order, such as patient medical records, stock market data, and data from smart gadgets. Anomaly detection is an important aspect of the Artificial Intelligence technique for detecting anomalies in time series data. The volume and complexity of data is rising at such a rate that standard methods of analysis are no longer capable of detecting hidden patterns or irregularities. The dynamic nature of time series data, which is observations collected over an extended period of time, makes it particularly difficult to work with. Time series data anomalies can take the form of abrupt spikes, extended aberrations, or subtle but recurring patterns that could point to a problem or possible danger.

Anomalies in patient medical records, for example, may indicate the beginning of a serious illness or the incidence of a medical error . Anomalies in stock market data may indicate fraud or market manipulation in the finance industry. Anomalies in production data can cause disturbances in manufacturing operations, which might result in equipment failures or faults. The digital world requires an effective anomaly detection system that can quickly spot these anomalies. The key purpose of this research is to shed light on comparing various existing strategies for detecting anomalies in time series data and to find the most effective anomaly detection techniques for Univariate and Multivariate Dataset.

Existing research

Anomaly detection techniques for time series data are among the topics of several research articles, spanning from statistical methods to deep learning models. Prominent papers, such as

Anomaly detection in time series: a comprehensive evaluation:
It discusses the importance of anomaly detection and the different algorithms that can be used. By analyzing around 71 anomaly detection methods on 976 distinct univariate and multivariate time series datasets, Schmidl et al. (2022) made a substantial contribution.

Their work attempted to assist researchers in choosing the optimal algorithm based on particular demands and aims, in addition to highlighting the benefits and drawbacks of various approaches. They conclude that there is no one-size-fits-all algorithm for anomaly detection, and the best algorithm for a particular task will depend on the specific characteristics of the data.

Multi-head CNN–RNN for multi-time series anomaly detection: An industrial case study

Multi-head CNN-RNN is a deep learning anomaly detection technique for both univariate and multivariate time series data. They carried out an extensive investigation on this technique. They examined 48 datasets to compare the performance of approaches such as Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Graph Neural Network (GNN), and autoencoders, offering insights into the applicability of these methods in different contexts. The architecture provided by them for multi-time series anomaly detection successfully found abnormalities in industrial applications, particularly for heterogeneous sensor systems. It outperformed classic anomaly detection approaches, notably in dealing with point, context-specific, and collective abnormalities.

A Comparative Analysis of Traditional and Deep Learning-Based Anomaly Detection Methods for Streamin....:

Munir et al.'s (2019) comparison of deep learning techniques with conventional machine learning techniques included K-Nearest-Neighbor, Local Outlier Factor, and Principal Component Analysis. In particular, deep learning methods outperformed other anomaly detection strategies in in terms of accuracy. It has the best average F1-score up to 0.90.

Together, these studies highlight the wide range of approaches available for anomaly identification, highlighting the necessity of a specific approach that takes into account the distinctive characteristics of the data as well as the types of anomalies that are there.

Our research

We conducted research on both univariate and multivariate datasets with the goal of finding the best performing anomaly detection methods. We used a variety of time series datasets, including synthetically generated datasets like ECG, Cylinder-Bell-Funnel, Sine-Mean, and Sine-Platform using GutenTag , as well as publicly accessible datasets like Secure Water Treatment,(SWAT) in order to thoroughly assess various approaches.

The datasets were meticulously pre-processed to guarantee the accuracy and dependability of our study. While data transformation comprised resolving missing values, transforming categorical variables into numerical representations, and altering data types, data cleaning involved removing null values. The preparation of the data for analysis was greatly aided by feature engineering techniques including feature scaling through standardization.

The selected anomaly detection methods comprised a range of approaches:

KNN (K-Nearest Neighbors): This classic machine learning technique determines how similar two data points are to one another by calculating how close they are to one other. It is especially helpful in situations when abnormalities cause typical data points to become less close together.

Isolation Forest: By dividing data points into isolation trees at random, this tree-based anomaly detection system separates anomalies. It is quite good at finding abnormalities in the dataset that are very noticeable.

DeepANT (Deep Anomaly Detection): uses deep learning to find abnormalities in complicated datasets, especially time series data. DeepANT is designed for situations in which anomalies may show complex patterns.

LSTM Autoencoder (Long Short-Term Memory Autoencoder): This deep learning-based technique for detecting anomalies uses LSTM networks to rebuild input data and spot anomalies. It captures temporal dependencies in time series data well.

LoOP (Local Outlier Probability) : The local density of nearby data points is used to calculate the likelihood that a given data point is an anomaly. It is employed in time series data anomaly identification. In situations when anomalies show confined patterns, LoOP works well.

Figure 1: Result of F1 score Vs AUC-ROC Score for SWAT dataset

Our major parameters for evaluation were the F1 score and the AUC-ROC score. When addressing class imbalance, the F1 score provides a fair assessment by combining recall and precision. An algorithm's ability to discriminate between normal and anomalous data points is evaluated using the AUC-ROC score.

Figure 2: Result of F1 score Vs AUC-ROC Score for ECG, Sine mean, Sine platform and CBF dataset.

These metrics offered a thorough picture of how well each anomaly detection method performed in various settings and datasets. Our results demonstrated the superiority of deep learning algorithms in time series data anomaly detection, especially DeepANT and LSTM-AE. The outcomes of these techniques were noticeably superior in both univariate and multivariate datasets. It is crucial to remember that there isn't a single algorithm that works for all situations, and the best approach will vary depending on the specifics of the data and the kinds of anomalies that are there.

What we learned

The purpose of our research ‘Anomaly Detection on Time Series Data’ was to evaluate different algorithms for anomaly detection on time series data. It entailed testing out different algorithms with different datasets, fine-tuning parameters, and evaluating results using the right metrics. This work allowed us to become experts in different aspects of this research, such as data pre-processing, algorithm development, and different evaluation methods. We now have a better understanding of the difficulties and complexities that this field presents.

The importance of data preprocessing has emerged as one of the key takeaways of this project. Time series data typically portrays a continual flow of observations, and this continuity is disturbed by missing values, making effective examination and modeling difficult.” In applications where time series data drives decisions and actions, having complete and reliable data is important. By filling in the missing values, you may ensure that the data's temporal relationship is preserved”. When features have significantly different scales, algorithms like k-Nearest Neighbours (KNN) which are scale sensitive can give higher weight to particular attributes, resulting in poor outcomes. We resolved this issue by standardization, which makes it possible for the algorithm to give each attribute the same weight.

Another important takeaway is that anomalies may not just be spikes in data; they can be continuous drifting, cyclic patterns, or dips in data. This underscores the importance of understanding various kinds of anomalies and how they fit into the data to choose the right algorithms, fine-tune parameters, and set appropriate thresholds.

Having domain knowledge is crucial to classifying what constitutes an anomaly. Additionally, there is a wide range of approaches available for identifying abnormalities in time series data. Choosing the right algorithm and evaluation metrics depending on the nature of the data and the type of anomaly present in the data is vital.

Our experience with Databricks

We have used Databricks for our research and it played a huge role in effectively conducting and completing our project. It significantly accelerated our research project, by providing an environment that made collaboration between the team members easier. Databricks’s collaborative workspace provides notebooks for interactive code development and data analysis. Team members can keep track of work, justify their approaches to problem-solving, and share their expertise in feature engineering and model selection all in one place. Our research involved working efficiently with large datasets. By using the scalable environment offered by Databricks for data processing, we were able to analyse massive volumes of data without running out of resources.

Databricks can distribute data and computational processes across an Apache Spark cluster when more resources are required to handle increasing processing needs. Its auto-scaling and distributed computing features increase or decrease the amount of computational resources based on the demands of the project.

Importing and handling large datasets to be used in our models was a challenging task, but Databricks has various tools and capabilities that made data import simpler and more efficient. We used Databricks’s distributed file system, DBFS, to store and manage our data files. You can upload data to DBFS using the Databricks workspace UI or programmatically with Databricks notebooks, the Databricks REST API, or using the Databricks CLI.

Databricks also provided a Runtime that functions as a built-in toolkit with libraries such as TensorFlow, XGBoost, and scikit-learn. This simplified building and implementing our complex machine learning models such as LSTM-AE.

In summary, Databricks combines tools for data preparation, model building, training, assessment, and deployment on a single platform. This streamlines the machine learning workflow from beginning to end, making it a flexible platform for industries involved with data science and machine learning.

Contributors:

Aishwarya Ashok Bodkhe

Acknowledgments:

We extend our heartfelt gratitude to our academic adviser, Prof. Hopfgartner, whose guidance and expertise have been instrumental in shaping our research. His unwavering support has been a beacon of inspiration throughout this project.

Furthermore, we express our appreciation to our industry adviser, Mr. Alan Mazankiewicz, whose practical insights and real-world experience have added invaluable depth to our work. Mr. Alan Mazankiewicz served as our mentor at Databricks, providing us with valuable industry perspectives.

Institutional Support:

We would also like to acknowledge the support of the Universität Koblenz for fostering an environment that encourages collaborative research and learning. Our affiliation with Uni Koblenz has played a crucial role in providing us with the resources and opportunities necessary for this endeavor.

As we share our discoveries and experiences, we hope this article inspires others in their data science journeys.

Thank you for being a part of our exploration!

Databricks Community

Detecting the Unseen: A Deep Dive into Anomaly Detection Techniques on Time Series Data

The need for anomaly detection across industries

Existing research

Our research

What we learned

Our experience with Databricks

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)