python - sklearn: remove low information features -


i trying text classification , eliminate low information features. have used code site:

text classification sentiment analysis – eliminate low information features

my adaption:

def calculatelowinformationfeatures(self, korpus, lowerbound, stopwords):  word_fd = freqdist() label_word_fd = conditionalfreqdist()  stoppwords_up = [stopword[:1].upper() + stopword[1:] stopword in stopwords] all_stopwords = set(stopwords + stoppwords_up )  word in korpus.words(categories=['pos']):     word = self.checkifisprep(word)     if self.wordpassescheck(word, all_stopwords ):         word_fd[word] += 1         label_word_fd['pos'][word] += 1  word in korpus.words(categories=['neg']):     word = self.checkifisprep(word)     if self.wordpassescheck(word, all_stopwords ):         word_fd[word] += 1         label_word_fd['neg'][word] += 1  pos_word_count = label_word_fd['pos'].n() neg_word_count = label_word_fd['neg'].n() total_word_count = pos_word_count + neg_word_count  word_scores = {}  word, freq in word_fd.items():     pos_score = bigramassocmeasures.chi_sq(label_word_fd['pos'][word],                                            (freq, pos_word_count), total_word_count)     neg_score = bigramassocmeasures.chi_sq(label_word_fd['neg'][word],                                            (freq, neg_word_count), total_word_count)     word_scores[word] = pos_score + neg_score  overall = sorted(word_scores.items(), key=operator.itemgetter(1), reverse=true) best = overall[:lowerbound] bestwords = set([w w, s in best])  return bestwords 

now i'm rewriting code using sklearn no nltk.

how implement using sklearn?

i have looked here: feature selection a) it's little on head , don't know how applicable text data.

this how process corpus:

corpus = load_files('corpus')  open('stopwords.txt', 'r') f:     stop_words = [y x in f.read().split('\n') y in (x, x.title())]  k_fold = kfold(n=len(corpus.data), n_folds=6)  pipeline = pipeline([     ('vec', countvectorizer(stop_words=stop_words, ngram_range=(1, 2))),     ('cl1', multinomialnb())])  corpusdatanp = np.array(corpus.data) corpustargetnp = np.array(corpus.target)  train_indices, test_indices in k_fold:     pipeline.fit(corpusdatanp[train_indices], corpustargetnp[train_indices])     predictions = pipeline.predict(corpusdatanp[test_indices]) 

how use, eg., selectkbest here (if it's idea @ all)?


Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -