Databricks Community

chandeldhruv · ‎01-09-2024

This article is part of a series on time series analysis in partnership with the University of Koblenz. See our other articles on data quality and anomaly detection.

By Dhruv Singh Chandel, Ravi Singh, Balram Tiwari, Ankush Arora and Unmesh Mhatre

Time series forecasting has immense potential to transform decision-making across diverse domains. By analysing historical data, we can uncover insightful trends and patterns that help predict future values. However, effectively harnessing the forecasting power of time series data requires meticulous examination. In our recent research, we undertook an applied exploration of time series forecasting techniques through collaboration with Databricks. Our rigorous benchmarking revealed the most accurate algorithms tailored for different time series datasets, offering data-driven guidance for practitioners.

The Significance of Time Series Forecasting

Time series forecasting enables data-driven decision-making by using past data to anticipate future values. It empowers organizations to boost efficiency, avoid risk, and capitalize on new opportunities. Consider a few examples that showcase its far-reaching potential:

Retail: Forecasting product demand based on past sales facilitates smart inventory management. This helps retailers avoid costly overstocks or out-of-stocks.
Finance: Forecasting movements in stock prices allows investors to make prudent trading decisions and maximize returns.
Operations: Forecasting server loads can enable cloud providers to intelligently scale infrastructure to meet future demands.
IoT: Forecasting trends in real-time sensor data from equipment can enable predictive maintenance, saving downtime costs.

The significance of time series forecasting surpasses domains. By understanding historical patterns, organizations gain a competitive edge to navigate the future.

Reviewing Existing Forecasting Techniques

To orient our exploration, we first surveyed established time series forecasting techniques documented in academic literature and industry research. This literature review provided context on the respective strengths and limitations of prevalent algorithms like ARIMA, SARIMA, Prophet, LSTM, and XGBoost. For example, research shows that combining ARIMA and ANN models can capture both linear and nonlinear patterns in time series data [1]. We also learned that LSTM networks are particularly suitable for multivariate forecasting problems with complex temporal dependencies [2].

These insights guided our selection of datasets and algorithms. We aimed to benchmark performance across retail sales, stock prices, and IoT sensor data using LSTM, XGBoost, Prophet, and ARIMA/SARIMA. Our methodology was crafted to offer a diverse perspective on real-world forecasting.

Tackling Multi-Domain Time Series Datasets

We applied time-series forecasting techniques to five distinct datasets.

Sales Data:

Stock Data:

IoT Sensor Data:

Multivariate time series data from environmental sensors

This variety of datasets allowed us to evaluate performance across different domains. We implemented systematic workflows encompassing data cleaning, featurization, normalization, train-test splitting, and model optimization.

Quick, iterative experiments were possible thanks to the cutting-edge infrastructure that Databricks provided. We used RMSE to quantify accuracy as we tuned hyperparameters and compared forecasts to ground-truth data.

Key Results

Dataset	*ARIMA/SARIMA*	*LSTM*	*Prophet*	*XGBoost*
*Apple Stock Data*	17.85	3.96	23.09	14.34
*Google Stock Data*	23.442	26.740	27.148	22.417
*Rossman Sales Data*	85.28	157.79	391.601	1354.44
*Wallmart Sales Data*	210.21	278.88	312.41	327.61
*IoT Sensor Data*	1.140	0.113	0.582	2.859

Table 1 : Performance Comparison (RMSE Values) of Forecasting Algorithms

For financial data, LSTM and XGBoost proved highly effective in modeling complex stock price fluctuations.
In sales forecasting, SARIMA emerged as the top performer, adeptly handling seasonal and cyclical potential retail patterns.
For IoT sensor data, LSTM achieved the lowest error by leveraging deep learning to model temporal relationships.

While no one-size-fits-all solution existed, these results validated the importance of selecting algorithms suited to handle the nuances of each dataset.

Check out the full code and documentation on our GitHub repository to explore this applied time series forecasting project further. We welcome any feedback to continue improving our scientific communication skills. This has been an insightful journey into the future potential of time-series forecasting.

Reflecting on Our Applied Forecasting Journey

This project offered enriching lessons that sharpened our intuition for applied time series analysis.

The intricacies of real-world data necessitate thoughtful preprocessing. Strategies like denoising, imputation, and featurization are key.
Forecasting performance varies across problem domains. Flexible modeling with cross-validation aids in robust algorithm selection.
Hyperparameter tuning meaningfully improves results. Infrastructure like Databricks enables rapid parallel tuning at scale.
Visualization is pivotal for tweaking models to improve fit to complex temporal patterns

Optimizing Predictions through Feature Extraction and Selection

In the field of stock analysis, the key to accurate predictions lies in meticulous feature extraction and strategic feature selection. The experiential insights emerging from this intensive exploration have undoubtedly enriched our comprehension of the field. Developing an intuition for the practicalities of time series forecasting is invaluable for practitioners.

Feature Extraction

Utilizing tsfresh, the class automatically extracts intricate patterns from stock data, transforming raw information into meaningful features. These features encapsulate vital market behaviors, enabling the model to discern nuanced trends beyond mere price points.

Feature Selection

The class employs SelectKBest from scikit-learn, meticulously choosing the top 10 extracted features. This strategic curation ensures the model focuses on the most influential aspects, optimizing its learning process. By selecting and refining features with precision, the model hones its predictive prowess, ultimately leading to more accurate and insightful stock predictions.

Through this intricate balance of extraction and selection, the StockAnalysis class achieves a nuanced understanding of market dynamics, elevating its forecasting capabilities to new heights.

Optimizing SARIMA Modeling through Stationarity and Parameter Selection

When implementing SARIMA models for time series forecasting, two techniques can significantly enhance performance: ensuring stationarity and identifying optimal model parameters.

Dickey-Fuller Test for Stationarity Parameter Selection
The Dickey-Fuller test determines if a time series dataset is stationary by comparing test statistics against critical values. If the test statistic is less than the critical value, the data is considered stationary. This test allows for the detection of non-stationarity, which can negatively affect SARIMA modeling. Identifying stationarity enables appropriate transformations, like differencing, to improve model accuracy.

ACF and PACF for Parameter Selection

A time series forecasting model, Seasonal Autoregressive Integrated Moving Average (SARIMA) builds on the ARIMA model by including seasonality. The autoregressive order (p), integrated order (d), and moving average order (q) terms represent key components of the SARIMA model:

p is the order of the Auto-Regressive (AR) component, indicating the number of lagged observations included in the model.
d is the degree of differencing, specifying how much differencing is applied to make the time series stationary.
q is the order of the Moving Average (MA) component, defining the number of lagged forecast errors included in the model.

The auto-correlation function (ACF) and partial auto-correlation function (PACF) plots help visually identify suitable SARIMA parameters p, d, q. The ACF plot shows a correlation between the time series and lagged values. PACF displays a correlation after removing intervening lag effects. Analyzing patterns in these plots can determine optimal SARIMA parameters. For example, if you observe a significant spike at lag 8 in the ACF plot and a significant spike at lag 8 in the PACF plot, it may indicate a seasonal pattern with a yearly cycle, suggesting a seasonal ARIMA component.

By carefully choosing p, d, and q by looking at ACF and PACF plots, you can get a SARIMA model that fits a time series dataset better and is more accurate. Hence, this ACF/PACF approach for parameter selection yielded lower forecast errors compared to random selection. It shows how useful plotting tools can be for finding parameters that best fit the natural patterns in the data, which improves the performance of the SARIMA model.

The Databricks Experience: Infrastructure and Environment

A highlight of this project was our experience leveraging Databricks' state-of-the-art infrastructure. Databricks provided three key advantages.

Streamlined Experimentation: The interactive workspace enabled rapid testing of ideas and seamless collaboration. This accelerated our analysis.

Scalable Computing: Distributed processing capabilities enabled fast hyperparameter tuning at scale. This allowed quick iteration for algorithm optimization.

Managed Infrastructure: Databricks provided a robust cloud-based platform that removed the complexities of setting up and managing infrastructure. Instead of having to configure clusters, servers, networking, storage, and software ourselves, Databricks took care of it behind the scenes.

Databricks' combination of streamlined workflows, scalable processing, and managed infrastructure amplified the impact of our time series forecasting exploration. Our positive experience showcases the immense value robust data and AI platforms offer in accelerating applied research.

Looking Ahead

This project has illuminated exciting possibilities for future work. There are two high-potential directions to take.

Ensemble Methods: Combining multiple models could yield improved predictive performance compared to individual techniques. This merits rigorous exploration.
Interpretability: Incorporating techniques to explain model forecasts will be pivotal for actionability. Providing practitioners with clear rationales for predictions can catalyze data-driven decision-making.

Our applied forecasting journey has equipped us with intuitions and tools to further expand the frontiers of time series analytics. There remain extensive opportunities to enhance techniques, particularly by pursuing interpretability and ensemble modeling. The insights gained from our comparative analysis provide a springboard to advance the state-of-the-art. We look forward to building on our work to unlock even greater value from time series data across domains. The potential to amplify forecasting capabilities motivates us to continually push boundaries through research and innovation.

Acknowledgements

We extend our sincere gratitude to Prof. Frank Hopfgartner, our academic advisor at the University of Koblenz, whose guidance and expertise played an important role in shaping the direction of this research. His insightful feedback and commitment to academic excellence have been instrumental in the success of our applied exploration of time series forecasting.

We would also like to express our appreciation to Tania Sennikova, our industry advisor and mentor from Databricks, for her invaluable contributions to this project. Tania's industry knowledge and practical insights greatly improved our understanding of real-world applications and enhanced the quality of our research.

References

[1] Zhang, Guoqiang Peter. "Time series forecasting using a hybrid ARIMA and neural network model." Neurocomputing 50.1-4 (2003): 159–175.

[2] Gers, Felix A., Douglas Eck, and Jürgen Schmidhuber. "Applying LSTM to time series predictable through time-window approaches." International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 2001.

Databricks Community

A Journey into the Future: An Applied Exploration of Time Series Forecasting

The Significance of Time Series Forecasting

Reviewing Existing Forecasting Techniques

Tackling Multi-Domain Time Series Datasets

Key Results

Reflecting on Our Applied Forecasting Journey

Optimizing Predictions through Feature Extraction and Selection

Optimizing SARIMA Modeling through Stationarity and Parameter Selection

The Databricks Experience: Infrastructure and Environment

Looking Ahead

Acknowledgements

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks