Databricks Community

BenMackenzie · ‎04-29-2024

This is the first of three articles about using the Databricks Feature Store. The first article will focus on using existing features to create your dataset and the basics of creating feature tables. The second article will cover feature table creation in greater depth, feature discovery and ensuring maximum re-usability. The final article will discuss feature and function serving and using the feature store with external models.

Feature Stores

Machine learning uses existing data to build a model to predict future outcomes. In almost all cases, the raw data requires preprocessing and transformation before it can be used to build a model. This process is called feature engineering, and the outputs of this process are called features - the building blocks of the model.

Developing features is complex and time-consuming. An additional complication is that for machine learning, feature calculations need to be done for model training and then again when the model is used to make predictions. These implementations may not be performed by the same team or using the same code environment, which can lead to delays and errors. Additionally, different teams in an organization will often have similar feature needs but may not be aware of the work that other teams have done. A feature store is designed to address these problems.

A feature store is a centralized repository that enables data scientists to find and share features and also ensures that the same code used to compute the feature values is used for model training and inference.

You won't want to miss this...

As a companion to this article, we spoke to one of our most valued partners, Gavi Regunath, about this subject. You can watch the recording of that discussion on the Advancing Analytics YouTube channel.

Databricks Feature Store

Databricks' Feature Store has been merged into Unity Catalog (now referred to as “Feature Engineering in Unity Catalog”). Features are stored in Feature tables, which are regular Delta tables with primary keys and, optionally a timestamp key identified. This integration enables a number of capabilities.

Unified permissions model: Reduces platform complexity with a consistent approach to securing and governing your Data and AI assets

Discoverability: The Feature Store UI, accessible from the Databricks workspace, lets you browse and search for existing features.
End-to-end Lineage: Better understand both the upstream - i.e. the data sources - and downstream dependencies, such as the models, notebooks, jobs, and endpoints that reference the features. access the models, notebooks, jobs, and endpoints that use the feature.
Data monitoring: Enhanced/Improved data monitoring: proactively detect and monitor your models to mitigate against data and model drift

In addition to these capabilities provided by Unity Catalog there are a number of other capabilities provided by the service:

Integrates with model scoring and serving. When you use features from Feature Store to train a model, the model is packaged with feature metadata. When you use the model for batch scoring or online inference, it automatically retrieves features from the Feature Store. The caller does not need to know about them or include logic to look up or join features to score new data. This makes model deployment and updates much easier.
Point-in-time lookups. Feature Store supports time series and event-based use cases that require point-in-time correctness ensuring that no future data is used for training.
Synchronizes with online row-oriented data stores to enable online inference.

On-demand Features

Feature Serving

Delta Tables vs Feature Tables

Integrating the Feature Store with Unity Catalog unlocks a lot of capabilities and simplifies its use. It does raise a question though: is there any distinction between a Feature Table and a regular Delta table (assuming a primary key has been defined)?

From a technical perspective, there is no distinction. Any table (with a PK defined) can be used as a feature table. Whether it will be useful or not depends on what you are trying to do. Keep in mind that for standard machine learning algorithms all of the features for a given example must be condensed into a single row. This often means that data needs to be aggregated before it can be used in a machine learning model.

For example, suppose you are building a customer churn model and you have a fact table of customer interactions with your support organization. Is there relevant information there? Certainly! Can you use the fact table as is? I.e., in the context of a customer churn model could it be a feature table? No. And the reason is simple: there is a one-to-many relationship between the customer and the interactions table. In a training set, each example must be a single row, so you would need to aggregate your interactions by customer before you can use it as a feature table to predict if a specific customer will churn.

What if the entity you are making predictions about is an ‘event’ or a ‘fact’ - for example whether a credit card or bank transfer transaction is fraudulent? In this case it does make sense to think of the fact table as a feature table. Any relevant dimension could also be regarded as a feature table. However, if there was an important signal in prior transaction history (for which there is a one-to-many relationship with the fact you are making a prediction about) you’d need to aggregate this before using it in a machine learning model.

Using the Feature Store

Let’s work through a simple example of using the feature store to create a training data set for a customer churn model. For now, we’ll assume that all of the features we want are already available in the feature store.

Defining a lookup DataFrame

We need to build a dataframe that includes the lookup keys and the ground truth label. The lookup keys consist of the primary key of the entity you are making predictions about and any foreign keys needed to retrieve relevant features. A timestamp is also required if any features of interest include a timestamp key (which is very often the case). There is a good example of this in the documentation.

How you construct this dataframe is critical to your machine learning effort. It determines what entities you include in your training data set, the point in time to retrieve a set of features (where applicable) and the label. We find it convenient to refer to this as the EOL dataframe (Entity, Observation time, Label).

Let’s suppose we track customer renewals in a table called customer_subscriptions Here’s an example of how we might construct our EOL dataframe. Notice that we are setting the observation date to 3 months prior to the renewal data - this is so that our model will learn to predict when a customer is likely to churn while there is still time for us to take action. Notice also that we are only looking at customers with a 3 year contract.

spark.sql(
    "
    CREATE OR REPLACE VIEW renewal_eol AS
    SELECT 	  customer_id
             , to_date(dateadd(month, -3, renewal_date)) AS observation_date
             , churn 
    FROM    customer_subscriptions
    WHERE   contract_length = 3
    "
)

eol_df = spark.sql('select * from renewal_eol')

Defining Features to use in our model

Let’s assume that we have pre-existing feature tables with information about consumption growth, interactions with customer service and the customer. We use the Databricks FeatureLookup API to define the features to include in our dataset:

from databricks.feature_engineering.entities.feature_lookup import FeatureLookup

feature_lookups = [
   FeatureLookup(
       table_name="default.consumption_growth",
       feature_names=["6_month_growth_p1", "6_month_growth_p2"],
       lookup_key="customer_id",
       timestamp_lookup_key = "observation_date"
   ),
   FeatureLookup(
       table_name="default.customer_service_calls",
       feature_names=["customer_service_count"],       
       lookup_key="customer_id",
       timestamp_lookup_key = "observation_date"
   ),
   FeatureLookup(
       table_name="default.customers",
       feature_names=["tier"],       
       lookup_key="customer_id",
       timestamp_lookup_key = "observation_date"
   )
]

Generating a training dataset

With our EOL dataframe and feature lookup definition we are ready to create a training dataset using the FeatureEngineeringClient.

from databricks.feature_engineering import FeatureEngineeringClient

fs = FeatureEngineeringClient()
training_set = fe.create_training_set(df=eol_df, feature_lookups=feature_lookups, label="churn")

Notice that we used features from multiple tables to construct a training dataset. This will often be the case especially when you are using features that another team has already created for you and there simply will not be a single ready-made feature table with everything you need. Organizing feature tables by domain, source tables, update frequency or other criteria is important for getting the most out of a feature store. We’ll discuss this in more detail in the next blog article in this series.

See also the Databricks documentation and more detailed example notebooks.

Mixing feature tables and non-feature tables for training

In the above example, we assumed that all of our features were available in the feature store already. In many cases you’ll want to use features that have not yet been stored in feature tables. To include these simply add them to the EOL dataframe. Features that will only be available at inference time (for example the amount of a credit card transaction) must be included in the EOL dataframe. See here.

Saving a model

Once the model is trained using an ML algorithm flavor known to MLflow, in order to capture the feature lookup information we need to log the model with log_model() passing the model object and the TrainingSet object created earlier by create_training_set, see this section or the notebook examples for more details.

Using a model for inference

For batch inference, use score_batch() passing the model URI and the data frame to be scored; see this section or the notebook examples for more details including how to get the model URI from the model name.

Train a model with AutoML and Feature Tables

You can also make use of Feature tables when using Databricks AutoML: this means that you are able to point AutoML to the same EOL table as you defined above. In this case, you would need to specify the feature tables to use either in the AutoML UI or using the API and setting the feature_store_lookups parameter.

Notes on ‘point-in-time’ lookup

In many cases the value of a feature changes over time. In such cases, you need a timestamp key in your feature table. When you retrieve the feature using the API you must include the observation date as described above. The Feature Store API will perform an ‘AS OF’ join to ensure that the most recent value of the feature at the time of the timestamp is used in the training set. See the documentation for more details.

Creating Feature Tables

Here are a few examples of creating feature tables. In practice, this simply means identifying the Primary and optionally the Timeseries key using SQL DDL or the Python feature_engineering API. Any table can be used but often they will be dimension, fact, or aggregation tables so we are using standard data dimensional modeling language here. Note that we are not suggesting that an existing ‘star schema’ model will suffice for all of your ML needs. You will almost certainly need to create new tables (or extend existing ones) to capture important features for ML requirements.

Type 1 Dimension Table

A type 1 dimension table (for example a customer table) can be used as a feature table as long as the PK has been defined.

ALTER TABLE customer ADD CONSTRAINT pk_customers PRIMARY KEY(customer_id)

Type 2 Dimension Table

A type 2 dimension table (for example a customer table) will have two columns to indicate the time range for which a particular row is active. These columns are generally a variant of ‘start date’ and ‘end date’. To use this kind of table as a feature table, define a primary key as a combination of the entity id (for example a customer id) and the column corresponding to the ‘start date’. For example:

ALTER TABLE customer ADD CONSTRAINT pk_customers PRIMARY KEY(customer_id, row_effective_from TIMESERIES)

The point-in-time lookup capability in the Feature Store API will ensure that the correct row is used.

Fact Table

If your model makes predictions about an entity represented in a fact table (e.g, is a particular transaction fraudulent?) then just make sure the table includes a primary key constraint:

ALTER TABLE transaction ADD CONSTRAINT pk_transaction_id PRIMARY KEY(transaction_id)

If your model makes predictions about some other entity but there is important signal in the fact table, you would typically aggregate the facts and store the results in a new table. For example if your model is trying to detect credit fraud you might aggregate facts related to account transfers over some time window. This example is dealt with below.

The case where you don’t want to aggregate the facts in some way is rare. In practice this means you want to retrieve that fact that is closest - but prior to - the observation date. To do this simply include a timestamp in the primary key constraint on the fact table.

Note that the fact table needs to include the entity pk as a foreign key in order to be able to perform the lookup.

Aggregation Table

As discussed above a model will often include information from a related fact table. Typically these facts are aggregated based on the associated entity (e.g., customer) and a time window (though a simple aggregation based on the entity alone is possible).

Except in the case of a simple aggregation based on the related entity you must include a timeseries column in the primary key constraint. In a later module we will discuss the time density of aggregation (e.g,. Are 30 day running aggregations computed every day? week?).

ALTER TABLE aggregation_table ADD CONSTRAINT pk_entity PRIMARY KEY(entity_id, timestamp TIMESERIES)

Preprocessing Features

The last step in feature engineering often involves transforming features to make them usable by particular machine learning algorithms. This is often referred to as pre-processing: transformations like scaling, normalizing, imputation and one-hot-encoding. Pre-processing should be performed as part of the model itself (for example in a sklearn pipeline) not performed ahead of time and included in the Feature Store. This allows data scientists to choose the kind of pre-processing appropriate for the algorithm they are using.

Putting it all together

In many machine learning projects you will use existing features as well as create new features for your model. This section highlights some of the key steps.

Start with your EOL definition. Define what entities you want to make predictions about. You may want to segment by geography or other characteristics. Define your ‘observation date’ and the associated label.
Define feature lookups for existing features. If there are relevant features already available in your model define the feature lookup structure.
Add new features to the EOL dataframe. Any new features that you want to use should be calculated and added to the EOL dataframe.
Use the feature API to generate a training dataset.
Build and evaluate your model and your features. How well does the model perform? Which features are significant?
Repeat the process until you are satisfied with the model and your features. You will likely experiment with the mode algorithm and hyper-parameters. But you will also want to experiment with features and observation dates.

Once you have a model you are happy with, new features should be saved in one or more feature tables along with appropriate pipelines for keeping them updated. This applies to all features except those that will only be available at inference time. These features need to be provided by the requesting system. When training a model using the feature store these features remain in the EOL dataframe.

Once you have created your new feature tables you should update the feature lookups to include the new features. Then re-build your model from scratch using an EOL dataframe.

When creating a new feature table you want to consider the following:

What entities should be included? Your model may target only a subset of the entities it is making predictions about (e.g., it may only target customers in a geographic region). In order to maximize the reusability of your feature table across different data science teams, you may want to include all customers.
How much history to include?
How frequently should time-dependent features be re-calculated?

For now, you might just include what is needed for your model. In future articles, we’ll discuss strategies to ensure that your features can be used by as many teams as possible.

Coming up next!

Next blog in this series: MLOps Gym - Harnessing the power of Spark in data science/machine learning workflows