python - what input for parameter vocabulary and tokenizer for tfidfvectorizer? -

February 15, 2010

what the proper input type tokenizer , vocabulary parameter on tfidfvectorizer?

first, tried custom tokenizer own tokenizing function, because need stem terms in own language, code below:

def tokenize(documents):     final_tokens = []     doc in documents:         tokens = doc.split()         tokens = [token.lower() token in tokens if len(token) > 2]         tokens = [stemmer.stem(token) token in tokens]         final_tokens.extend(tokens)     return final_tokens  vectorizer = tfidfvectorizer(stop_words=stops,min_df=0.1,tokenizer=tokenize) features = vectorizer.fit_transform(titles)

however when want print features, got error this:

valueerror: empty vocabulary; perhaps documents contain stop words

second, have list of terms , document frequencies taken csv files contains result of dataset tokenizing. code below:

vocabs = [row[1] row in list(csv.reader(dataset_tokenize))]  vectorizer = tfidfvectorizer(stop_words=stops,vocabulary=vocabs) features = vectorizer.fit_transform(titles)

now got new error:

valueerror: duplicate term in vocabulary: 'term'

is there can me right input parameters? answer appreciated :)

Search This Blog

CSS

python - what input for parameter vocabulary and tokenizer for tfidfvectorizer? -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -