This is the second of three articles about using the Databricks Feature Store. The first article focused on using existing features to create your dataset and the basics of creating feature tables. This second article will cover feature table creation in greater depth, feature discovery, and ensuring maximum re-usability. The final article will discuss feature and function serving and using the feature store with external models.
If you have spent any time working with datasets from Kaggle you may have developed that habit of thinking about data as a single CSV file or table. The reality is that by the time you have distilled your data into a single table, most of the work of machine learning has already been done! In the ‘real world’ your data lives in a Lakehouse, Warehouse, or a collection of operational RDBMSs. For this reason, it makes sense to think about your data using standard data warehousing terms, namely dimension, fact, and aggregate tables. The entities you want to make predictions about either live in dimension tables (e.g., customer) or fact tables (e.g., a transaction).
Your job as an ML practitioner is to use dimensions, facts, and aggregation tables to create a dataset. Let’s look at each of these in turn.
If your model requires bringing in new data in the form of dimension or fact tables just follow standard practices for ingesting and processing data on Databricks.
Assuming the data is available, your dimension tables should always be SCD type 2. Create a primary key constraint on the main entity along with the ‘valid_from’ timestamp discussed in the previous article.
Your fact table will require a foreign key (FK) reference to the entity in your EOL table. And as discussed in the previous article you will want a primary key (PK) constraint on the fact identifier and timestamp column.
In some cases a row in a dimension or fact table requires some processing in order to extract a usable feature. This could involve converting two dates into an interval, converting a zip code into a latitude and longitude, or any other transformation that does not involve aggregating rows. This can be handled in three ways:
Aggregations of fact tables are a very common type of feature used in machine learning projects: whenever there is a one-to-many relationship between the entity you want to make predictions about and some other table, you must use aggregations.
Let's think about an aggregation table that might be used for a credit fraud model. We will consider two features: a 14-day running total and a 7 transaction count.
This is what our table might look like:
primary_account_number |
timestamp |
14-day-total |
7-day-count |
abc |
Jan 1 2022 |
450.10 |
18 |
abc |
Jan 2 2022 |
376.11 |
16 |
… |
|||
xyz |
May 7 2024 |
1299.87 |
45 |
xyz |
May 8 2024 |
1400.65 |
41 |
xyz |
May 9 2024 |
138823 |
40 |
The first decisions you need to make when you create an aggregation table are:
Even if you are only including a subset of the primary account holders in the training data set for your current model, you should consider including a broader set in your feature table in order to promote sharing and reuse. This decision is very domain specific so consider this to be general guidance: the entities in your feature tables should be broader than your current needs.
There is a practical limit to the depth of the history corresponding to the depth of the history of the underlying entity. In this example, there is nothing to aggregate prior to the customer lifetime of the cardholder. So you could choose to include aggregations for each customer starting from the moment they became a cardholder right up to the current time. Alternatively you chose to include aggregations for all customers over a common interval. The feature values would be set to 0 for periods outside of a customer's history. Conceptually there is no difference between these two approaches: dataset creation for your model should be driven by an EOL table which should not have any references to time periods prior to an entity's history. The only difference is how errors are handled (i.e., if your EOL table contains a reference to a period prior to the commencement of the entity's history). In the first case, you will get an error. In the second case you will get a ‘silent failure’ where the feature store will happily return features with 0 value.
You may also elect to shorten the history to the last 24 months for example based on the argument that patterns are sufficiently volatile that it would never make sense to train a model with older data.
As with the question of which entities to include, the general guidance is that your history should be deeper than your current needs. But again there are a lot of domain-specific considerations here.
In the example above features are recomputed every day. This seems like reasonable grain for 7 and 14-day rolling aggregations, though depending on the volatility of the data you might want to recompute them more frequently. If your features aggregated data over longer time periods (months or years) you might only need to recompute features on a weekly or monthly basis. Conversely, if your features aggregate data over shorter time periods (hours) you would need to re-compute your features more frequently. Note that the ‘grain’ of the feature (days in the examples above) is not the same as the grain of the feature table. The grain of the features table can be finer than the grain of the features: you might recompute a 14-day rolling aggregate every 6 hours though it is probably more common that they correspond.
Note that this implies that your feature table will often have more data than required for your particular training dataset: you only require a version of each feature for each entry in your EOL table, but maintaining a denser feature table ensures that the feature can be more readily used by other teams who will likely have different EOL tables.
Once you have defined your features, you need to write the code to actually create them. As we’ve seen, ML features are just regular columns in a table. There are many ways of creating feature tables in Databricks as there are ways of creating ‘regular’ tables: remember a feature table is just a delta table with a primary key constraint. Options include: SQL, Pyspark, DBT or DLT.
In this article, we jumped right in and started building feature tables for our new features. This overlooks an important part of the process: feature discovery. In other words, how do you decide which features to include in your model training? Do we really know that 14-day rolling aggregates have any predictive value? Maybe you need to look at a longer period, maybe a shorter period, maybe you should only aggregate certain transactions. Until you have determined that a particular feature is predictive it’s probably premature to add it to a feature table. Just calculate the feature for your training dataset and add it to the EOLdataframe.
When you are engaged in feature discovery, you don't need to worry about scope, history, or grain. Just compute the features you need based on your EOL dataframe.
In this article we’ve discussed how machine learning features relate to regular data warehousing concepts. We’ve also looked at the scope, history depth and grain of a feature. As in the previous article we connect everything together using the concept of an EOL table (Entity, Observation Date, Label).
Happy Feature Engineering!
Next blog in this series: MLOps Gym - The 'promote model' pattern for MLOps.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.