how to create a word2vec model with data extracted from wikipedia summary in python -


i extract data wikipedia summary page of "machine learning" , use data build word2vec model gensim library.

so, first wiki summary of "machine learning" (wikipedia api python):

sentences = wikipedia.summary("machine learning") 

and create model:

model = gensim.models.word2vec(sentences, min_count=2, size=50, window=4) 

the problem that, if print vocabulary keys, list of characters rather list of words. following code use print vocabulary keys:

print list(model.vocab.keys()) 

where wrong?

here pasted full code:

import wikipedia, gensim.models sentences = wikipedia.summary("machine learning") model = gensim.models.word2vec(sentences, min_count=2, size=50, window=4) print list(model.vocab.keys()) 

you missing following 2 things:

  1. converting unicode utf-8
  2. use of gensim.models.word2vec.linesentence making gensim object

following complete working python script:

# libraries gensim.models import word2vec gensim.models.word2vec import linesentence import wikipedia  # word2vec model parameters min_count = 2 size = 50 window = 4  # getting "machine learning" summary wikipedia summary = wikipedia.summary("machine learning")  # changing unicode utf-8 , writing summary text file text = summary.encode("utf-8") filewriter = open("machine_learning.txt", "w") filewriter.write(text) filewriter.close()  # reading machine_learning.txt file using linesentence sentences = linesentence("machine_learning.txt")  # making gensim model , training on sentences model = word2vec(sentences, min_count = min_count, size = size, window = window)  # printing model's vocablury print(model.vocab.keys())  # printing vector 'learning' word print(model["learning"]) 

hope helps..!


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -