<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Creating Pandas Data Frame of Features After Applying Variance Reduction in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/creating-pandas-data-frame-of-features-after-applying-variance/m-p/15371#M9703</link>
    <description>&lt;P&gt;This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support" target="test_blank"&gt;https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 14 Sep 2021 18:28:14 GMT</pubDate>
    <dc:creator>Dan_Z</dc:creator>
    <dc:date>2021-09-14T18:28:14Z</dc:date>
    <item>
      <title>Creating Pandas Data Frame of Features After Applying Variance Reduction</title>
      <link>https://community.databricks.com/t5/data-engineering/creating-pandas-data-frame-of-features-after-applying-variance/m-p/15370#M9702</link>
      <description>&lt;P&gt;I am building a classification model using the following data frame of 120,000 records (sample of 5 records shown):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="df"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2422i62441FB66FABDB43/image-size/large?v=v2&amp;amp;px=999" role="button" title="df" alt="df" /&gt;&lt;/span&gt;Using this data, I have built the following model:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_selection import VarianceThreshold         
&amp;nbsp;
model = MultinomialNB()
X_train, X_test, y_train, y_test = train_test_split(df2['descrp_clean'], df2['group_names'], random_state = 0, test_size=0.25, stratify=df2['group_names'])
&amp;nbsp;
# For each record, calculate tf-idf 
######################################################################################################################################################
tfidf = TfidfVectorizer(min_df=3,ngram_range=(1,3))  
&amp;nbsp;
# X_Train: Get (1) tfidf and (2) reduce dimentionality
#######################################################
x_train_tfidf = tfidf.fit_transform(X_train)      
VT_reduce=VarianceThreshold(threshold=0.000005)     
x_train_tfidf_reduced=VT_reduce.fit_transform(x_train_tfidf)   
&amp;nbsp;
# Estimate Naive Bayes model 
#######################################################
clf = model.fit(x_train_tfidf_reduced, y_train)
&amp;nbsp;
# X_test: Apply Variance Threshold 
#######################################################
x_test_tfidf=tfidf.transform(X_test)    
x_test_tfidf_reduced= VT_reduce.transform(x_test_tfidf)     
&amp;nbsp;
# Predict using model
######################################################
y_pred = model.predict(x_test_tfidf_reduced)
&amp;nbsp;
# Compare actual to predicted results
######################################################
model.score(x_test_tfidf_reduced,y_test)*100&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I can create a dataframe showing word tokens before applying variance threshold:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;X_train_tokens=tfidf.get_feature_names()
x_train_df=pd.DataFrame(X_train_tokens)
x_train_df.tail(5)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;After variance reduction features are reduced to 21,758:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="df3"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2429i43C88D1264C04D9D/image-size/large?v=v2&amp;amp;px=999" role="button" title="df3" alt="df3" /&gt;&lt;/span&gt;&lt;B&gt;Question&lt;/B&gt;: How do I create a dataframe like x_train_df of my features&amp;nbsp;&lt;B&gt;after&lt;/B&gt;&amp;nbsp;applying variance reduction that will show my 21,758 features?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Sep 2021 18:07:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/creating-pandas-data-frame-of-features-after-applying-variance/m-p/15370#M9702</guid>
      <dc:creator>Jack</dc:creator>
      <dc:date>2021-09-14T18:07:01Z</dc:date>
    </item>
    <item>
      <title>Re: Creating Pandas Data Frame of Features After Applying Variance Reduction</title>
      <link>https://community.databricks.com/t5/data-engineering/creating-pandas-data-frame-of-features-after-applying-variance/m-p/15371#M9703</link>
      <description>&lt;P&gt;This is more of a scikit-learn question than a Databricks question. But poking around I think VT_reduced.get_support() is probably what you are looking for:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support" target="test_blank"&gt;https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold.get_support&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Sep 2021 18:28:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/creating-pandas-data-frame-of-features-after-applying-variance/m-p/15371#M9703</guid>
      <dc:creator>Dan_Z</dc:creator>
      <dc:date>2021-09-14T18:28:14Z</dc:date>
    </item>
  </channel>
</rss>

