This article is part of a series on time series analysis in partnership with the University of Koblenz. See our other articles on data quality and anomaly detection.
By Dhruv Singh Chandel, Ravi Singh, Balram Tiwari, Ankush Arora and Unmesh Mhatre
Time series forecasting has immense potential to transform decision-making across diverse domains. By analysing historical data, we can uncover insightful trends and patterns that help predict future values. However, effectively harnessing the forecasting power of time series data requires meticulous examination. In our recent research, we undertook an applied exploration of time series forecasting techniques through collaboration with Databricks. Our rigorous benchmarking revealed the most accurate algorithms tailored for different time series datasets, offering data-driven guidance for practitioners.
Time series forecasting enables data-driven decision-making by using past data to anticipate future values. It empowers organizations to boost efficiency, avoid risk, and capitalize on new opportunities. Consider a few examples that showcase its far-reaching potential:
The significance of time series forecasting surpasses domains. By understanding historical patterns, organizations gain a competitive edge to navigate the future.
To orient our exploration, we first surveyed established time series forecasting techniques documented in academic literature and industry research. This literature review provided context on the respective strengths and limitations of prevalent algorithms like ARIMA, SARIMA, Prophet, LSTM, and XGBoost. For example, research shows that combining ARIMA and ANN models can capture both linear and nonlinear patterns in time series data [1]. We also learned that LSTM networks are particularly suitable for multivariate forecasting problems with complex temporal dependencies [2].
These insights guided our selection of datasets and algorithms. We aimed to benchmark performance across retail sales, stock prices, and IoT sensor data using LSTM, XGBoost, Prophet, and ARIMA/SARIMA. Our methodology was crafted to offer a diverse perspective on real-world forecasting.
We applied time-series forecasting techniques to five distinct datasets.
Sales Data:
Stock Data:
IoT Sensor Data:
This variety of datasets allowed us to evaluate performance across different domains. We implemented systematic workflows encompassing data cleaning, featurization, normalization, train-test splitting, and model optimization.
Quick, iterative experiments were possible thanks to the cutting-edge infrastructure that Databricks provided. We used RMSE to quantify accuracy as we tuned hyperparameters and compared forecasts to ground-truth data.
Dataset |
ARIMA/SARIMA |
LSTM |
Prophet |
XGBoost |
Apple Stock Data |
17.85 |
3.96 |
23.09 |
14.34 |
Google Stock Data |
23.442 |
26.740 |
27.148 |
22.417 |
Rossman Sales Data |
85.28 |
157.79 |
391.601 |
1354.44 |
Wallmart Sales Data |
210.21 |
278.88 |
312.41 |
327.61 |
IoT Sensor Data |
1.140 |
0.113 |
0.582 |
2.859 |
Table 1 : Performance Comparison (RMSE Values) of Forecasting Algorithms
While no one-size-fits-all solution existed, these results validated the importance of selecting algorithms suited to handle the nuances of each dataset.
Check out the full code and documentation on our GitHub repository to explore this applied time series forecasting project further. We welcome any feedback to continue improving our scientific communication skills. This has been an insightful journey into the future potential of time-series forecasting.
This project offered enriching lessons that sharpened our intuition for applied time series analysis.
In the field of stock analysis, the key to accurate predictions lies in meticulous feature extraction and strategic feature selection. The experiential insights emerging from this intensive exploration have undoubtedly enriched our comprehension of the field. Developing an intuition for the practicalities of time series forecasting is invaluable for practitioners.
Feature Extraction
Utilizing tsfresh, the class automatically extracts intricate patterns from stock data, transforming raw information into meaningful features. These features encapsulate vital market behaviors, enabling the model to discern nuanced trends beyond mere price points.
Feature Selection
The class employs SelectKBest from scikit-learn, meticulously choosing the top 10 extracted features. This strategic curation ensures the model focuses on the most influential aspects, optimizing its learning process. By selecting and refining features with precision, the model hones its predictive prowess, ultimately leading to more accurate and insightful stock predictions.
Through this intricate balance of extraction and selection, the StockAnalysis class achieves a nuanced understanding of market dynamics, elevating its forecasting capabilities to new heights.
When implementing SARIMA models for time series forecasting, two techniques can significantly enhance performance: ensuring stationarity and identifying optimal model parameters.
Dickey-Fuller Test for Stationarity Parameter Selection
The Dickey-Fuller test determines if a time series dataset is stationary by comparing test statistics against critical values. If the test statistic is less than the critical value, the data is considered stationary. This test allows for the detection of non-stationarity, which can negatively affect SARIMA modeling. Identifying stationarity enables appropriate transformations, like differencing, to improve model accuracy.
ACF and PACF for Parameter Selection
A time series forecasting model, Seasonal Autoregressive Integrated Moving Average (SARIMA) builds on the ARIMA model by including seasonality. The autoregressive order (p), integrated order (d), and moving average order (q) terms represent key components of the SARIMA model:
The auto-correlation function (ACF) and partial auto-correlation function (PACF) plots help visually identify suitable SARIMA parameters p, d, q. The ACF plot shows a correlation between the time series and lagged values. PACF displays a correlation after removing intervening lag effects. Analyzing patterns in these plots can determine optimal SARIMA parameters. For example, if you observe a significant spike at lag 8 in the ACF plot and a significant spike at lag 8 in the PACF plot, it may indicate a seasonal pattern with a yearly cycle, suggesting a seasonal ARIMA component.
By carefully choosing p, d, and q by looking at ACF and PACF plots, you can get a SARIMA model that fits a time series dataset better and is more accurate. Hence, this ACF/PACF approach for parameter selection yielded lower forecast errors compared to random selection. It shows how useful plotting tools can be for finding parameters that best fit the natural patterns in the data, which improves the performance of the SARIMA model.
A highlight of this project was our experience leveraging Databricks' state-of-the-art infrastructure. Databricks provided three key advantages.
Streamlined Experimentation: The interactive workspace enabled rapid testing of ideas and seamless collaboration. This accelerated our analysis.
Scalable Computing: Distributed processing capabilities enabled fast hyperparameter tuning at scale. This allowed quick iteration for algorithm optimization.
Managed Infrastructure: Databricks provided a robust cloud-based platform that removed the complexities of setting up and managing infrastructure. Instead of having to configure clusters, servers, networking, storage, and software ourselves, Databricks took care of it behind the scenes.
Databricks' combination of streamlined workflows, scalable processing, and managed infrastructure amplified the impact of our time series forecasting exploration. Our positive experience showcases the immense value robust data and AI platforms offer in accelerating applied research.
This project has illuminated exciting possibilities for future work. There are two high-potential directions to take.
Ensemble Methods: Combining multiple models could yield improved predictive performance compared to individual techniques. This merits rigorous exploration.
Interpretability: Incorporating techniques to explain model forecasts will be pivotal for actionability. Providing practitioners with clear rationales for predictions can catalyze data-driven decision-making.
Our applied forecasting journey has equipped us with intuitions and tools to further expand the frontiers of time series analytics. There remain extensive opportunities to enhance techniques, particularly by pursuing interpretability and ensemble modeling. The insights gained from our comparative analysis provide a springboard to advance the state-of-the-art. We look forward to building on our work to unlock even greater value from time series data across domains. The potential to amplify forecasting capabilities motivates us to continually push boundaries through research and innovation.
We extend our sincere gratitude to Prof. Frank Hopfgartner, our academic advisor at the University of Koblenz, whose guidance and expertise played an important role in shaping the direction of this research. His insightful feedback and commitment to academic excellence have been instrumental in the success of our applied exploration of time series forecasting.
We would also like to express our appreciation to Tania Sennikova, our industry advisor and mentor from Databricks, for her invaluable contributions to this project. Tania's industry knowledge and practical insights greatly improved our understanding of real-world applications and enhanced the quality of our research.
References
[1] Zhang, Guoqiang Peter. "Time series forecasting using a hybrid ARIMA and neural network model." Neurocomputing 50.1-4 (2003): 159–175.
[2] Gers, Felix A., Douglas Eck, and Jürgen Schmidhuber. "Applying LSTM to time series predictable through time-window approaches." International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 2001.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.