python - what input for parameter vocabulary and tokenizer for tfidfvectorizer? -


what the proper input type tokenizer , vocabulary parameter on tfidfvectorizer?

first, tried custom tokenizer own tokenizing function, because need stem terms in own language, code below:

def tokenize(documents):     final_tokens = []     doc in documents:         tokens = doc.split()         tokens = [token.lower() token in tokens if len(token) > 2]         tokens = [stemmer.stem(token) token in tokens]         final_tokens.extend(tokens)     return final_tokens  vectorizer = tfidfvectorizer(stop_words=stops,min_df=0.1,tokenizer=tokenize) features = vectorizer.fit_transform(titles) 

however when want print features, got error this:

valueerror: empty vocabulary; perhaps documents contain stop words 

second, have list of terms , document frequencies taken csv files contains result of dataset tokenizing. code below:

vocabs = [row[1] row in list(csv.reader(dataset_tokenize))]  vectorizer = tfidfvectorizer(stop_words=stops,vocabulary=vocabs) features = vectorizer.fit_transform(titles) 

now got new error:

valueerror: duplicate term in vocabulary: 'term' 

is there can me right input parameters? answer appreciated :)


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -