python - what input for parameter vocabulary and tokenizer for tfidfvectorizer? -
what the proper input type tokenizer
, vocabulary
parameter on tfidfvectorizer?
first, tried custom tokenizer
own tokenizing function, because need stem terms in own language, code below:
def tokenize(documents): final_tokens = [] doc in documents: tokens = doc.split() tokens = [token.lower() token in tokens if len(token) > 2] tokens = [stemmer.stem(token) token in tokens] final_tokens.extend(tokens) return final_tokens vectorizer = tfidfvectorizer(stop_words=stops,min_df=0.1,tokenizer=tokenize) features = vectorizer.fit_transform(titles)
however when want print features
, got error this:
valueerror: empty vocabulary; perhaps documents contain stop words
second, have list of terms , document frequencies taken csv files contains result of dataset tokenizing. code below:
vocabs = [row[1] row in list(csv.reader(dataset_tokenize))] vectorizer = tfidfvectorizer(stop_words=stops,vocabulary=vocabs) features = vectorizer.fit_transform(titles)
now got new error:
valueerror: duplicate term in vocabulary: 'term'
is there can me right input parameters? answer appreciated :)
Comments
Post a Comment