python - scikit-learn SelectPercentile TFIDF data feature reduction -


i using various mechanisms in scikit-learn create tf-idf representation of training data set , test set consisting of text features. both data sets preprocessed use same vocabulary features , number of features same. can create model on training data , assess performance on test data. wondering if use selectpercentile reduce number of features in training set after transformation, how can identify same features in test set utilise in prediction?

traindensedata = traintransformeddata.toarray() testdensedata = testtransformeddata.toarray()  if ( usefeaturereduction== true):     reducedtraindata = selectpercentile(f_regression,percentile=10).fit_transform(traindensedata,trainyarray)  clf.fit(reducedtraindata, trainyarray)   # apply feature reduction test data 

see code , comments below.

import numpy np  sklearn.datasets import make_classification sklearn import feature_selection  # build classification task using 3 informative features x, y = make_classification(n_samples=1000,                            n_features=10,                            n_informative=3,                            n_redundant=0,                            n_repeated=0,                            n_classes=2,                            random_state=0,                            shuffle=false)  sp = feature_selection.selectpercentile(feature_selection.f_regression, percentile=30) sp.fit_transform(x[:-1], y[:-1])  #here, training first 9 data vectors, , last 1 test set idx = np.arange(0, x.shape[1])  #create index array features_to_keep = idx[sp.get_support() == true]  #get index positions of kept features  x_fs = x[:,features_to_keep] #prune x data vectors x_test_fs = x_fs[-1] #take last data vector (the test set) pruned values print x_test_fs #these pruned test set values  

Comments

Popular posts from this blog

How to run C# code using mono without Xamarin in Android? -

c# - SharpSsh Command Execution -

python - Specify path of savefig with pylab or matplotlib -