cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

How is model drift calculated when the baseline table has no timestamp column?

MohsenJ
New Contributor III

I try to understand how Databricks computes the model drift when the baseline table is available. What I understood from the documentation is Databricks processes both the primary and the baseline tables according to the specified granularities in the monitor, store this result in the profile metric table, and then use a specific measure such as KS test to compare the distribution between the values of both tables in a given window.

What I can't figure out is how it works if my baseline table has no timestamp. This is the only information I found in the documentation which is very vague: 

.... The exception is the timestamp column for tables used with time series or inference profiles. If columns are missing in either the primary table or the baseline table, monitoring uses best-effort heuristics to compute the output metrics


For example, when I use the model serving endpoint, the timestamp column of my primary table corresponds to the time when a client calls the endpoint to compute the prediction for some query. Now, imagine I want to use my validation dataset as the baseline table. How does Databricks match the rows of the two tables?


1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @MohsenJLet’s delve into how Databricks handles model drift calculation when the baseline table lacks a timestamp column.

  1. Baseline Table without Timestamp:

    • When your baseline table doesn’t have a timestamp column, Databricks employs best-effort heuristics to compute the output metrics for model drift.
    • Essentially, it tries to make reasonable estimates even in the absence of explicit timestamps.
  2. Model Drift Calculation Process:

    • Databricks processes both the primary (current) and baseline (historical) tables based on the specified granularities defined in the monitoring configuration.
    • The results are stored in the profile metric table.
    • To compare the distribution between the values of both tables, Databricks typically uses a specific measure such as the Kolmogorov-Smirnov (KS) test.
  3. Handling Timestamps:

    • In scenarios where you have a timestamp column (e.g., when using time series or inference profiles), Databricks leverages this information for drift calculations.
    • However, if the timestamp column is missing in either the primary or baseline table, it resorts to the aforementioned heuristics.
  4. Example with Model Serving Endpoint:

    • Suppose your primary table’s timestamp corresponds to when a client calls the model serving endpoint to compute predictions.
    • If you want to use your validation dataset (which lacks timestamps) as the baseline, Databricks will still perform drift analysis.
    • It will align rows based on other available features or columns, even if explicit timestamps are absent.
  5. Matching Rows:

    • Databricks attempts to match rows between the primary and baseline tables using common features or heuristics.
    • While it won’t be as precise as timestamp-based matching, it aims to provide meaningful insights into model drift.

In summary, Databricks adapts its approach based on the available information. When timestamps are missing, it relies on clever estimation techniques to compute drift metrics. Keep in mind that this process may not be as accurate as when explicit timestamps are present, but i...12.

 

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @MohsenJLet’s delve into how Databricks handles model drift calculation when the baseline table lacks a timestamp column.

  1. Baseline Table without Timestamp:

    • When your baseline table doesn’t have a timestamp column, Databricks employs best-effort heuristics to compute the output metrics for model drift.
    • Essentially, it tries to make reasonable estimates even in the absence of explicit timestamps.
  2. Model Drift Calculation Process:

    • Databricks processes both the primary (current) and baseline (historical) tables based on the specified granularities defined in the monitoring configuration.
    • The results are stored in the profile metric table.
    • To compare the distribution between the values of both tables, Databricks typically uses a specific measure such as the Kolmogorov-Smirnov (KS) test.
  3. Handling Timestamps:

    • In scenarios where you have a timestamp column (e.g., when using time series or inference profiles), Databricks leverages this information for drift calculations.
    • However, if the timestamp column is missing in either the primary or baseline table, it resorts to the aforementioned heuristics.
  4. Example with Model Serving Endpoint:

    • Suppose your primary table’s timestamp corresponds to when a client calls the model serving endpoint to compute predictions.
    • If you want to use your validation dataset (which lacks timestamps) as the baseline, Databricks will still perform drift analysis.
    • It will align rows based on other available features or columns, even if explicit timestamps are absent.
  5. Matching Rows:

    • Databricks attempts to match rows between the primary and baseline tables using common features or heuristics.
    • While it won’t be as precise as timestamp-based matching, it aims to provide meaningful insights into model drift.

In summary, Databricks adapts its approach based on the available information. When timestamps are missing, it relies on clever estimation techniques to compute drift metrics. Keep in mind that this process may not be as accurate as when explicit timestamps are present, but i...12.