Databricks Community

yoni-r · ‎10-23-2024

Using TimesFM on Databricks with covariate support

For industries that rely on accurate predictions to inform decision-making, time series forecasting has long been an essential tool. From predicting energy consumption to demand forecasting in retail, it is important to be able to predict future values based on historical data. The traditional time series models, such as exponential smoothing and ARIMA, while effective, come with performance and accuracy limitations when processing large-scale data or complex patterns.

Generative AI, particularly Time Series Transformers (TSTs), has come to the fore as a technology which can handle these challenges. Similar to the transformer models used in natural language processing (NLP), TSTs can capture complex, non-linear dependencies over long sequences. This makes them adaptable to real-world data with missing values, seasonality, and irregular patterns. TSTs use a self-attention mechanism to analyse time-series data and capture seasonality patterns. These models are pre-trained on huge datasets to create foundation models, which can then be fine-tuned to adapt to specific time series data.

This blog, building upon An Introduction to Time Series Forecasting with Generative AI, zooms in specifically on TimesFM, a pre-trained time series foundation model from Google Research. We will explore how you can leverage this model with Databricks, focusing on its support for covariates and its limitations.

Introduction to TimesFM

TimesFM, short for Time Series Foundation Model, is an open-source time series forecasting model developed by Google Research. Built on a transformer-based architecture, TimesFM can handle different forecasting tasks, ranging from short-term predictions to long-range forecasts. While models like Chronos treat time series data similarly to natural language models, TimesFM also incorporates mechanisms specific to time series data, such as seasonality, missing values, and multivariate dependencies.

As it was pre-trained on over 100 billion real-world time series points, TimesFM is effective in generalising to new datasets. In many cases, it enables accurate zero-shot predictions without the need for retraining on the new datasets. The extensive pre-training also gives TimesFM the ability to recognise both short- and long-term dependencies in time series data. This makes TimesFM useful across a variety of applications where there is a need to capture seasonal patterns and trends.

The TimesFM architecture is shown in Figure 1. While the details of this architecture are beyond our scope here, we encourage you to refer to the research paper.

Figure 1: TimesFM architecture (from https://research.google/blog/a-decoder-only-foundation-model-for-time-series-forecasting/)

Another useful aspect of TimesFM is the support for external covariates, since time series do not occur in isolation. Numerous variables and external factors, such as economic indicators or weather conditions can correlate with a time series. Their inclusion in the analysis can enhance prediction accuracy. We will look at support for external covariates in a later section. Next, let’s focus on the use of TimesFM with Databricks.

Using TimesFM with Databricks

Databricks provides a scalable and powerful cloud environment for training and using machine learning models. Combining TimesFM's forecasting capabilities and Databricks' infrastructure, including support for GPUs, we can handle large-scale time series data, using TimesFM with our specific dataset, and efficiently manage end-to-end machine learning workflows. In addition, with Unity Catalog, Databricks provides governance for both data and models.

Getting Started with TimesFM on Databricks

We will give an overview of the main steps to set up and use TimesFM with Databricks here. The code example provided in An Introduction to Time Series Forecasting with Generative AI provides further details. We will also expand on these steps in the next section on covariates support.

1. Environment Setup

The first step in using TimesFM on Databricks is to configure your environment by installing the required libraries. We will need a Databricks cluster to run the notebook, install the library, and for the model's forecasting. We can create a cluster or use a predefined Serverless cluster, as we will see in the upcoming section on covariates support.

2. Data Preparation

Before we can use the model, we will need to load the time series data into a Pandas dataframe if we want to use Pandas. Note that we can make forecasts on a Numpy ndarray as well.

3. Model Initialization and Forecasting

After setting up the Databricks environment with TimesFM, we can load the latest checkpoint of the pre-trained model. The model can then be used with the time series data. The forecast_on_df() function is used to generate forecasts.

4. Incorporating Covariates

As mentioned earlier, one of the benefits of using TimesFM is the ability to include covariates, that is, other factors that can influence the forecast. With Databricks, you can ingest multiple large covariate datasets, joining them to the time series data being forecasted, and providing them as covariate inputs to TimesFM. We will discuss this further in the upcoming section on covariates.

5. Model Deployment

The additional benefit of using Databricks is its end to end MLOps support, including for model deployment. With Databricks’ MLflow integration, you can track experiment outcomes, manage different model versions, and deploy the model into production. The deployed model can be used for batch and real-time forecasting. Moreover, the model performance can be monitored for drift detection and continuous forecasting improvements using Databricks Lakehouse Monitoring.

As highlighted in this section, Databricks provides an end to end environment for leveraging TimesFM to handle time series forecasting at scale. Next, we will explore the use of covariates support with TimesFM.

Covariates Support in TimesFM

TimesFM's covariates support is one of its standout features. Covariates include factors such as economic indicators like interest rates in financial models, weather data in retail demand forecasting, or other external factors that affect the primary time series.

TimesFM can handle both univariate and multivariate time series forecasting with covariates. This allows it to capture the correlation between the target time series and these external variables. The covariates are input as parallel sequences, which enables the model to learn how they correlate with future values over time. This enhances the model’s ability to adapt to the real-world where external factors often play a key role in outcomes. This makes TimesFM more accurate than traditional time series models and other time series foundation models such as Chronos that do not consider such variables.

In practice, covariates are added in TimesFM via the input data structure which includes both the main time series and the covariates in parallel. In this section, we will dive into the details of how to do this on Databricks.

Covariates Support Steps

Here's a consolidated guide on how to add covariates support to TimesFM on Databricks, with code extracts and detailed steps.

1. Set Up the Environment

In this section, we will install the necessary libraries to run TimesFM. The following location is used to get the TimesFM package:

https://github.com/google-research/timesfm.git

Note that this setup will be simplified with the upcoming support for PyPi packages.

For the execution environment, we can use the predefined Databricks Serverless cluster with CPU support. We can also create our own Databricks cluster configuration with CPU or GPU, starting with a single node and scaling out depending on the workload’s requirement. When creating a cluster configuration, ensure that the Databricks Machine Learning Runtime (MLR) version is compatible with the latest TimesFM version or the one you are using. At the time of writing, we used version 14.3 LTS ML.

Once the compute selected, we run the following commands in a notebook to install the libraries.

# Install supporting libraries
%pip install jax[cuda12]==0.4.26 --quiet
%pip install protobuf==3.20.* --quiet
%pip install utilsforecast --quiet
%pip install torch –-quiet

# Restart python kernel to use updated libraries
dbutils.library.restartPython()

# Install TimesFM library
import sys
import subprocess
package = "git+https://github.com/google-research/timesfm.git"
subprocess.check_call([sys.executable, "-m", "pip", "install", package, "--quiet"])

# Import timesfm library
import timesfm

2. Load the Pre-Trained TimesFM Model

Once the environment is set up, we will load a pre-trained TimesFM model.

# Load TimesFM pretrained checkpoints with hyperparameters
tfm = timesfm.TimesFm(
  hparams = timesfm.TimesFmHparams(
    context_len=512, # max 512, can be shorter
    horizon_len=128,
    input_patch_len=32,
    output_patch_len=128,
    num_layers=20,
    model_dims=1280,
    backend="cpu", # ‘gpu’ when using GPU backend for fine-tuning
  ),
  checkpoint = timesfm.TimesFmCheckpoint(
    huggingface_repo_id="google/timesfm-1.0-200m-pytorch"
  )
)

Note that TimesFM on Databricks supports both CPU and GPU. We have used CPU here as this gives us sufficient performance for the small size of the dataset we are using.

3. Prepare the Dataset

Before getting to forecasting, we need to prepare our time series data. This can include cleaning the dataset, ensuring that the data is free of missing values as well as the frequency of the dataset being consistent. As these details are beyond our scope for this blog, we will be using a dataset with minimal required transformations.

In this example, we are using a dataset for air pollution, reading from a CSV file as a source. The only transformation we need to do is change the date column to the correct data type.

df = pd.read_csv('/Volumes/forecast/airpollution/dataset/airpollution.csv')
df['date'] = pd.to_datetime(df['date'])

The following shows a sample from the dataset.

Figure 2: Dataset sample.

In this example, we want to forecast the pollution level. We assume that we are better at forecasting temperature, which we want to use as a covariate to forecast pollution.

We will be using the following parameters to prepare the data for use with the model.

batch_size: int = 128,
context_len: int = 120,
horizon_len: int = 24,

This means storing the data in the correct structure as follows.

examples["inputs"].append(sub_df["pollution"][start:(context_end := start + context_len)].tolist())
examples["temp"].append(sub_df["temp"][start:context_end + horizon_len].tolist())

Note that we need the dynamic covariate to cover both the forecasting context and the horizon (context_end + horizon_len).

4. Forecasting

The following code example shows the use of the forecast_with_covariates() function with our air pollution dataset:

cov_forecast, ols_forecast = model.forecast_with_covariates( 
    inputs=example["inputs"],
    dynamic_numerical_covariates={
        "temp": example["temp"],
    },
    dynamic_categorical_covariates={},
    static_numerical_covariates={},
    static_categorical_covariates={
        "wnd_dir": example["wnd_dir"]
    },
    freq=[0] * len(example["inputs"]),
    xreg_mode="xreg + timesfm",              # default
    ridge=0.0,
    force_on_cpu=False,
    normalize_xreg_target_per_input=True,    # default
)

You can find a complete code example of covariate support with another dataset here.

Single Covariate

After the forecasting step in the previous section, we measured the Mean Absolute Error (MAE) to evaluate the model. We got the following values:

eval_mae_timesfm: 57.511073775267526
eval_mae_xreg_timesfm: 64.58154015704602

The univariate forecasting yielded a better result (eval_mae_timesfm) than the forecasting with a single covariate (eval_mae_xreg_timesfm).

To investigate why this could be the case, let us go back and check our assumptions on the covariate. One quick way to do so is to calculate the correlation matrix, as follows.

pdf_for_corr = df[["pollution", "temp", "wnd_spd", "dew", "press", "snow", "rain"]]
pdf_for_corr.corr()

The outcome, as we can see below, shows a low correlation between pollution and temperature.

Figure 3: Correlation matrix.

While TimesFM provides a useful feature with covariate support, the wrong choice of covariate can produce worse results. The fact that a single covariate did not improve the result here does not mean this will be the case with your dataset. It is recommended to experiment with a simpler single covariate model first before using more covariates, which we will do next.

Multiple Covariates

As TimesFM supports the use of multiple covariates, we can investigate the impact of using additional covariates with the following change to the code:

dynamic_numerical_covariates={
    "temp": example["temp"],
    "wnd_spd": example["wnd_spd"],
    "dew": example["dew"],
    "press": example["press"],
},

This does result in an improved MAE:

eval_mae_timesfm: 57.511073775267526
eval_mae_xreg_timesfm: 56.83941069676225

The practical consideration of using multiple covariates is the availability of reliable covariates or forecasting models for the covariates. The important point here is that we should compare the results without and with single or multiple covariates and experiment to ensure that we are improving the forecasting model.

Note that the use of TimesFM described in this blog is based on the checkpointed version available at the time of writing. As this is subject to change in the future, always refer to the latest accompanying documentation.

Conclusion

When used in the right way, Generative AI models like TimesFM offer unprecedented forecasting capabilities across multiple industries, from financial forecasting to supply chain management. With the ability to predict trends and behaviours with greater precision, these models enable businesses to better decision-making and more efficient operations.

As we have seen in this blog, the integration of Generative AI with Databricks enhances capabilities even further. By providing the infrastructure to automate and scale time series model training and forecasting, Databricks allows businesses to leverage models such as TimesFM for their specific datasets and applications. Businesses can take full advantage of Generative AI for time series forecasting models on Databricks.

Enhance your Time Series Analysis capabilities with Generative AI on Databricks.