Databricks Community

Jack · ‎09-14-2021

I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):

Using this data, I have built the following model:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         
 
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
 
# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  
 
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   
 
# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
 
# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     
 
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
 
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100

I can create a dataframe showing word tokens before applying variance threshold:

X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)

After variance reduction features are reduced to 21,758:

Question: How do I create a dataframe like x_train_df of my features after applying variance reduction that will show my 21,758 features?

Dan_Z · ‎09-14-2021

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

View solution in original post

Dan_Z · ‎09-14-2021

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...