python - scikit-learn SelectPercentile TFIDF data feature reduction -
i using various mechanisms in scikit-learn create tf-idf representation of training data set , test set consisting of text features. both data sets preprocessed use same vocabulary features , number of features same. can create model on training data , assess performance on test data. wondering if use selectpercentile reduce number of features in training set after transformation, how can identify same features in test set utilise in prediction?
traindensedata = traintransformeddata.toarray() testdensedata = testtransformeddata.toarray() if ( usefeaturereduction== true): reducedtraindata = selectpercentile(f_regression,percentile=10).fit_transform(traindensedata,trainyarray) clf.fit(reducedtraindata, trainyarray) # apply feature reduction test data
see code , comments below.
import numpy np sklearn.datasets import make_classification sklearn import feature_selection # build classification task using 3 informative features x, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=false) sp = feature_selection.selectpercentile(feature_selection.f_regression, percentile=30) sp.fit_transform(x[:-1], y[:-1]) #here, training first 9 data vectors, , last 1 test set idx = np.arange(0, x.shape[1]) #create index array features_to_keep = idx[sp.get_support() == true] #get index positions of kept features x_fs = x[:,features_to_keep] #prune x data vectors x_test_fs = x_fs[-1] #take last data vector (the test set) pruned values print x_test_fs #these pruned test set values
Comments
Post a Comment