how to create a word2vec model with data extracted from wikipedia summary in python -
i extract data wikipedia summary page of "machine learning" , use data build word2vec model gensim library.
so, first wiki summary of "machine learning" (wikipedia api python):
sentences = wikipedia.summary("machine learning")
and create model:
model = gensim.models.word2vec(sentences, min_count=2, size=50, window=4)
the problem that, if print vocabulary keys, list of characters rather list of words. following code use print vocabulary keys:
print list(model.vocab.keys())
where wrong?
here pasted full code:
import wikipedia, gensim.models sentences = wikipedia.summary("machine learning") model = gensim.models.word2vec(sentences, min_count=2, size=50, window=4) print list(model.vocab.keys())
you missing following 2 things:
- converting unicode utf-8
- use of gensim.models.word2vec.linesentence making gensim object
following complete working python script:
# libraries gensim.models import word2vec gensim.models.word2vec import linesentence import wikipedia # word2vec model parameters min_count = 2 size = 50 window = 4 # getting "machine learning" summary wikipedia summary = wikipedia.summary("machine learning") # changing unicode utf-8 , writing summary text file text = summary.encode("utf-8") filewriter = open("machine_learning.txt", "w") filewriter.write(text) filewriter.close() # reading machine_learning.txt file using linesentence sentences = linesentence("machine_learning.txt") # making gensim model , training on sentences model = word2vec(sentences, min_count = min_count, size = size, window = window) # printing model's vocablury print(model.vocab.keys()) # printing vector 'learning' word print(model["learning"])
hope helps..!
Comments
Post a Comment