cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating Pandas Data Frame of Features After Applying Variance Reduction

Jack
New Contributor II

I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):

dfUsing this data, I have built the following model:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         
 
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
 
# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  
 
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   
 
# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
 
# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     
 
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
 
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100

I can create a dataframe showing word tokens before applying variance threshold:

X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)

After variance reduction features are reduced to 21,758:

df3Question: How do I create a dataframe like x_train_df of my features after applying variance reduction that will show my 21,758 features?

1 ACCEPTED SOLUTION

Accepted Solutions

Dan_Z
Honored Contributor
Honored Contributor

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

View solution in original post

1 REPLY 1

Dan_Z
Honored Contributor
Honored Contributor

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!