cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating Pandas Data Frame of Features After Applying Variance Reduction

Jack
New Contributor II

I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):

dfUsing this data, I have built the following model:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         
 
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
 
# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  
 
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   
 
# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
 
# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     
 
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
 
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100

I can create a dataframe showing word tokens before applying variance threshold:

X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)

After variance reduction features are reduced to 21,758:

df3Question: How do I create a dataframe like x_train_df of my features after applying variance reduction that will show my 21,758 features?

1 ACCEPTED SOLUTION

Accepted Solutions

Dan_Z
Databricks Employee
Databricks Employee

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

View solution in original post

1 REPLY 1

Dan_Z
Databricks Employee
Databricks Employee

This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:

https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#s...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group