Hello,
I am training a logistic regression on text with the help of an tf-idf vectorizer.
This is done with MLflow and sklearn in databricks.
The model itself is trained successfully in databricks and it is possible to accomplish predictions within the jupyter notebook on the databricks platform.
The MLflow code that creates the model:
with mlflow.start_run(run_name='logistic_regression') as run:
text_transformer = TfidfVectorizer(stop_words=['english'], ngram_range=(1, 2), lowercase=True, max_features=150000)
lr = LogisticRegression(C=5e1, solver='lbfgs', multi_class='multinomial', random_state=17, n_jobs=4)
text_transformer.fit(train_val['text'])
mlflow.sklearn.log_model(text_transformer, "tfidf-model")
X_train_text = text_transformer.transform(train_val['text'])
X_test_text = text_transformer.transform(test['text'])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)
cv_results = cross_val_score(lr, X_train_text, train_val['label'], cv=skf, scoring='f1_micro')
mlflow.log_param("F1_score", cv_results.mean())
lr.fit(X_train_text, train_val['label'])
mlflow.sklearn.log_model(lr, "lr-model")
In the models tab it is only possible to serve the logistic regression without an issue.
However, for serving the tfidf vectorizer there arises the following issue:
Traceback (most recent call last):
File "<string>", line 1, in <module>
KeyError: 'python_function'
Inspecting the two models under experiments, it is noticeable, that the tfidf vectorizer does not contain the attributes for the key 'python_function'.
logistic regression:
artifact_path: lr-model
databricks_runtime: 10.4.x-scala2.12
flavors:
python_function:
env: conda.yaml
loader_module: mlflow.sklearn
model_path: model.pkl
python_version: 3.8.10
sklearn:
code: null
pickled_model: model.pkl
serialization_format: cloudpickle
sklearn_version: 0.24.1
mlflow_version: 1.28.0
model_uuid: some number
run_id: some number
utc_time_created: 'some date'
tfidf:
artifact_path: tfidf-model
databricks_runtime: 10.4.x-scala2.12
flavors:
sklearn:
code: null
pickled_model: model.pkl
serialization_format: cloudpickle
sklearn_version: 0.24.1
mlflow_version: 1.28.0
model_uuid: some number
run_id: some number
utc_time_created: 'some date'
Question:
- Why is the tfidf model file structured differently / why does it lack python_function?
- Is it possible to edit these model files manually, such that I can add the key python_function?
Thanks a lot for your help in advance,
best,
matebreeze