cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating Pandas Data Frame of Features After Applying Variance Reduction

Jack
New Contributor II

I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):

dfUsing this data, I have built the following model:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         
 
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
 
# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  
 
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   
 
# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
 
# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     
 
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
 
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100

I can create a dataframe showing word tokens before applying variance threshold:

X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)

After variance reduction features are reduced to 21,758:

df3Question: How do I create a dataframe like x_train_df of my features after applying variance reduction that will show my 21,758 features?

1 ACCEPTED SOLUTION

Accepted Solutions

Dan_Z
Honored Contributor
Honored Contributor

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

View solution in original post

1 REPLY 1

Dan_Z
Honored Contributor
Honored Contributor

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.