python - sklearn: remove low information features -
i trying text classification , eliminate low information features. have used code site:
text classification sentiment analysis – eliminate low information features
my adaption:
def calculatelowinformationfeatures(self, korpus, lowerbound, stopwords): word_fd = freqdist() label_word_fd = conditionalfreqdist() stoppwords_up = [stopword[:1].upper() + stopword[1:] stopword in stopwords] all_stopwords = set(stopwords + stoppwords_up ) word in korpus.words(categories=['pos']): word = self.checkifisprep(word) if self.wordpassescheck(word, all_stopwords ): word_fd[word] += 1 label_word_fd['pos'][word] += 1 word in korpus.words(categories=['neg']): word = self.checkifisprep(word) if self.wordpassescheck(word, all_stopwords ): word_fd[word] += 1 label_word_fd['neg'][word] += 1 pos_word_count = label_word_fd['pos'].n() neg_word_count = label_word_fd['neg'].n() total_word_count = pos_word_count + neg_word_count word_scores = {} word, freq in word_fd.items(): pos_score = bigramassocmeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = bigramassocmeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score overall = sorted(word_scores.items(), key=operator.itemgetter(1), reverse=true) best = overall[:lowerbound] bestwords = set([w w, s in best]) return bestwords now i'm rewriting code using sklearn no nltk.
how implement using sklearn?
i have looked here: feature selection a) it's little on head , don't know how applicable text data.
this how process corpus:
corpus = load_files('corpus') open('stopwords.txt', 'r') f: stop_words = [y x in f.read().split('\n') y in (x, x.title())] k_fold = kfold(n=len(corpus.data), n_folds=6) pipeline = pipeline([ ('vec', countvectorizer(stop_words=stop_words, ngram_range=(1, 2))), ('cl1', multinomialnb())]) corpusdatanp = np.array(corpus.data) corpustargetnp = np.array(corpus.target) train_indices, test_indices in k_fold: pipeline.fit(corpusdatanp[train_indices], corpustargetnp[train_indices]) predictions = pipeline.predict(corpusdatanp[test_indices]) how use, eg., selectkbest here (if it's idea @ all)?
Comments
Post a Comment