This is the first of three articles about using the Databricks Feature Store. The first article will focus on using existing features to create your dataset and the basics of creating feature tables. The second article will cover feature table creation in greater depth, feature discovery and ensuring maximum re-usability. The final article will discuss feature and function serving and using the feature store with external models.
Machine learning uses existing data to build a model to predict future outcomes. In almost all cases, the raw data requires preprocessing and transformation before it can be used to build a model. This process is called feature engineering, and the outputs of this process are called features - the building blocks of the model.
Developing features is complex and time-consuming. An additional complication is that for machine learning, feature calculations need to be done for model training and then again when the model is used to make predictions. These implementations may not be performed by the same team or using the same code environment, which can lead to delays and errors. Additionally, different teams in an organization will often have similar feature needs but may not be aware of the work that other teams have done. A feature store is designed to address these problems.
A feature store is a centralized repository that enables data scientists to find and share features and also ensures that the same code used to compute the feature values is used for model training and inference.
As a companion to this article, we spoke to one of our most valued partners, Gavi Regunath, about this subject. You can watch the recording of that discussion on the Advancing Analytics YouTube channel.
Databricks' Feature Store has been merged into Unity Catalog (now referred to as “Feature Engineering in Unity Catalog”). Features are stored in Feature tables, which are regular Delta tables with primary keys and, optionally a timestamp key identified. This integration enables a number of capabilities.
In addition to these capabilities provided by Unity Catalog there are a number of other capabilities provided by the service:
Integrating the Feature Store with Unity Catalog unlocks a lot of capabilities and simplifies its use. It does raise a question though: is there any distinction between a Feature Table and a regular Delta table (assuming a primary key has been defined)?
From a technical perspective, there is no distinction. Any table (with a PK defined) can be used as a feature table. Whether it will be useful or not depends on what you are trying to do. Keep in mind that for standard machine learning algorithms all of the features for a given example must be condensed into a single row. This often means that data needs to be aggregated before it can be used in a machine learning model.
For example, suppose you are building a customer churn model and you have a fact table of customer interactions with your support organization. Is there relevant information there? Certainly! Can you use the fact table as is? I.e., in the context of a customer churn model could it be a feature table? No. And the reason is simple: there is a one-to-many relationship between the customer and the interactions table. In a training set, each example must be a single row, so you would need to aggregate your interactions by customer before you can use it as a feature table to predict if a specific customer will churn.
What if the entity you are making predictions about is an ‘event’ or a ‘fact’ - for example whether a credit card or bank transfer transaction is fraudulent? In this case it does make sense to think of the fact table as a feature table. Any relevant dimension could also be regarded as a feature table. However, if there was an important signal in prior transaction history (for which there is a one-to-many relationship with the fact you are making a prediction about) you’d need to aggregate this before using it in a machine learning model.
Let’s work through a simple example of using the feature store to create a training data set for a customer churn model. For now, we’ll assume that all of the features we want are already available in the feature store.
We need to build a dataframe that includes the lookup keys and the ground truth label. The lookup keys consist of the primary key of the entity you are making predictions about and any foreign keys needed to retrieve relevant features. A timestamp is also required if any features of interest include a timestamp key (which is very often the case). There is a good example of this in the documentation.
How you construct this dataframe is critical to your machine learning effort. It determines what entities you include in your training data set, the point in time to retrieve a set of features (where applicable) and the label. We find it convenient to refer to this as the EOL dataframe (Entity, Observation time, Label).
Let’s suppose we track customer renewals in a table called customer_subscriptions Here’s an example of how we might construct our EOL dataframe. Notice that we are setting the observation date to 3 months prior to the renewal data - this is so that our model will learn to predict when a customer is likely to churn while there is still time for us to take action. Notice also that we are only looking at customers with a 3 year contract.
|
Let’s assume that we have pre-existing feature tables with information about consumption growth, interactions with customer service and the customer. We use the Databricks FeatureLookup API to define the features to include in our dataset:
|
With our EOL dataframe and feature lookup definition we are ready to create a training dataset using the FeatureEngineeringClient.
|
Notice that we used features from multiple tables to construct a training dataset. This will often be the case especially when you are using features that another team has already created for you and there simply will not be a single ready-made feature table with everything you need. Organizing feature tables by domain, source tables, update frequency or other criteria is important for getting the most out of a feature store. We’ll discuss this in more detail in the next blog article in this series.
See also the Databricks documentation and more detailed example notebooks.
In the above example, we assumed that all of our features were available in the feature store already. In many cases you’ll want to use features that have not yet been stored in feature tables. To include these simply add them to the EOL dataframe. Features that will only be available at inference time (for example the amount of a credit card transaction) must be included in the EOL dataframe. See here.
Once the model is trained using an ML algorithm flavor known to MLflow, in order to capture the feature lookup information we need to log the model with log_model() passing the model object and the TrainingSet object created earlier by create_training_set, see this section or the notebook examples for more details.
For batch inference, use score_batch() passing the model URI and the data frame to be scored; see this section or the notebook examples for more details including how to get the model URI from the model name.
You can also make use of Feature tables when using Databricks AutoML: this means that you are able to point AutoML to the same EOL table as you defined above. In this case, you would need to specify the feature tables to use either in the AutoML UI or using the API and setting the feature_store_lookups parameter.
In many cases the value of a feature changes over time. In such cases, you need a timestamp key in your feature table. When you retrieve the feature using the API you must include the observation date as described above. The Feature Store API will perform an ‘AS OF’ join to ensure that the most recent value of the feature at the time of the timestamp is used in the training set. See the documentation for more details.
Here are a few examples of creating feature tables. In practice, this simply means identifying the Primary and optionally the Timeseries key using SQL DDL or the Python feature_engineering API. Any table can be used but often they will be dimension, fact, or aggregation tables so we are using standard data dimensional modeling language here. Note that we are not suggesting that an existing ‘star schema’ model will suffice for all of your ML needs. You will almost certainly need to create new tables (or extend existing ones) to capture important features for ML requirements.
A type 1 dimension table (for example a customer table) can be used as a feature table as long as the PK has been defined.
|
A type 2 dimension table (for example a customer table) will have two columns to indicate the time range for which a particular row is active. These columns are generally a variant of ‘start date’ and ‘end date’. To use this kind of table as a feature table, define a primary key as a combination of the entity id (for example a customer id) and the column corresponding to the ‘start date’. For example:
|
The point-in-time lookup capability in the Feature Store API will ensure that the correct row is used.
If your model makes predictions about an entity represented in a fact table (e.g, is a particular transaction fraudulent?) then just make sure the table includes a primary key constraint:
|
If your model makes predictions about some other entity but there is important signal in the fact table, you would typically aggregate the facts and store the results in a new table. For example if your model is trying to detect credit fraud you might aggregate facts related to account transfers over some time window. This example is dealt with below.
The case where you don’t want to aggregate the facts in some way is rare. In practice this means you want to retrieve that fact that is closest - but prior to - the observation date. To do this simply include a timestamp in the primary key constraint on the fact table.
Note that the fact table needs to include the entity pk as a foreign key in order to be able to perform the lookup.
As discussed above a model will often include information from a related fact table. Typically these facts are aggregated based on the associated entity (e.g., customer) and a time window (though a simple aggregation based on the entity alone is possible).
Except in the case of a simple aggregation based on the related entity you must include a timeseries column in the primary key constraint. In a later module we will discuss the time density of aggregation (e.g,. Are 30 day running aggregations computed every day? week?).
|
The last step in feature engineering often involves transforming features to make them usable by particular machine learning algorithms. This is often referred to as pre-processing: transformations like scaling, normalizing, imputation and one-hot-encoding. Pre-processing should be performed as part of the model itself (for example in a sklearn pipeline) not performed ahead of time and included in the Feature Store. This allows data scientists to choose the kind of pre-processing appropriate for the algorithm they are using.
In many machine learning projects you will use existing features as well as create new features for your model. This section highlights some of the key steps.
Once you have a model you are happy with, new features should be saved in one or more feature tables along with appropriate pipelines for keeping them updated. This applies to all features except those that will only be available at inference time. These features need to be provided by the requesting system. When training a model using the feature store these features remain in the EOL dataframe.
Once you have created your new feature tables you should update the feature lookups to include the new features. Then re-build your model from scratch using an EOL dataframe.
When creating a new feature table you want to consider the following:
For now, you might just include what is needed for your model. In future articles, we’ll discuss strategies to ensure that your features can be used by as many teams as possible.
Next blog in this series: MLOps Gym - Harnessing the power of Spark in data science/machine learning workflows
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.