I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):
Using this data, I have built the following model:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
# For each record, calculate tf-idf
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)
VT_reduce=VarianceThreshold(threshold=0.000005)
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)
# Estimate Naive Bayes model
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
# X_test: Apply Variance Threshold
#######################################################
x_test_tfidf=tfidf.transform(X_test)
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100
I can create a dataframe showing word tokens before applying variance threshold:
X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)
After variance reduction features are reduced to 21,758:
Question: How do I create a dataframe like x_train_df of my features after applying variance reduction that will show my 21,758 features?