cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Feature Store with Spark Pipeline

haseeb2001
New Contributor II

Hi,

I am using a spark pipeline having stages VectoreAssembler, StandardScalor, StringIndexers, VectorAssembler, GbtClassifier. And then logging this pipeline using feature store log_model function as follows:

fe = FeatureStoreClient() // I have tried this using FeatureStoreEngineeringClient too

After defining lookups and creating a training_set, I am logging this model using:

 

 

fe.log_model ( model=model_pipeline, artifact_path = "test_model", flavor = mlflow.spark, training_set = training_set, registered_model_name = "registery_name")

 

 

After logging this model, I am using fe.score function to get results on my test data. But I am getting the following error:

 

image.png

 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @haseeb2001It seems you’re encountering an issue after logging your Spark pipeline model using the fe.log_model function in MLflow.

Let’s break down the steps and address the error:

  1. Pipeline Stages:

    • You’ve mentioned several stages in your Spark pipeline: VectorAssembler, StandardScaler, StringIndexers, and GBTClassifier.
    • Each of these stages plays a specific role in your machine learning workflow.
  2. Logging the Model:

    • You’re using the fe.log_model function to log your model. This function is part of MLflow, a powerful tool for managing machine learning experiments and models.
    • The artifact_path specifies where the model artefacts will be stored.
    • The flavor parameter indicates the format in which the model should be saved (in your case, mlflow.spark).
    • The training_set and registered_model_name parameters are also relevant for tracking and organizing your models.
  3. Troubleshooting:

    • Let’s start by checking the following:
      • Ensure that all stages in your pipeline are correctly set up and compatible with each other.
      • Verify that the training data (training_set) is properly prepared and matches the features used during training.
      • Double-check the registered model name (registered_model_name) to ensure it’s unique and doesn’t conflict with existing models.
      • Review any additional logs or stack traces related to the error for more clues.
  4. Spark UDF Issue:

    • If you’re encountering issues with the fe.score function (which I assume is used for inference), consider the following:
      • Ensure that the input data for scoring matches the features used during training.
      • Check if any custom preprocessing or transformations are needed before scoring.
      • Verify that the model artifacts are correctly loaded during inference.
  5. Debugging:

    • If you can provide the specific error message or any additional context, I’d be happy to assist further in debugging the issue.
    • Feel free to share more details, and we’ll work together to resolve it! 😊

Hi @Kaniz_Fatma , thanks for your response.

The issue I am facing is during fe.score_batch. I have tried logging this pipeline using mlflow only and then tested it for inference too and it worked fine. The issue appears only when I use feature store batch scoring.

I have noticed that when I applied score it used python_function as the backend flavor, while I have registered my model using spark flavor. Any thoughts on this?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group