python - Kmeans: Terms occurring in more than one cluster? -


using kmeans tf-idf vectorizer possible terms occurring in more 1 cluster?

here dataset of examples:

documents = ["human machine interface lab abc computer applications",              "a survey of user opinion of computer system response time",              "the eps user interface management system",              "system , human system engineering testing of eps",              "relation of user perceived response time error measurement",              "the generation of random binary unordered trees",              "the intersection graph of paths in trees",              "graph minors iv widths of trees , quasi ordering",              "graph minors survey"] 

i use tf-idf vectorizer feature extraction:

vectorizer = tfidfvectorizer(stop_words='english') feature = vectorizer.fit_transform(documents) true_k = 3 km = kmeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) km.fit(feature) order_centroids = km.cluster_centers_.argsort()[:, ::-1] print "top terms per cluster:" in range(true_k):     print "cluster %d:" % i,     ind in order_centroids[i, :10]:         print ' %s,' % terms[ind],     print 

when cluster documents using kmeans scikit-learn, results below:

top terms per cluster: cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing, cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human, cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps, 

we can see terms occur in more 1 cluster(e.g, graph in cluster 1 , 2,eps in cluster 0 , 2).

are cluster results wrong? or acceptable because tf-idf score terms above each document different?

i think bit confused on trying do. code use gives clustering of documents, not terms. terms dimensions clustering.

if want find cluster each document belongs need use predict or fit_predict method, this:

vectorizer = tfidfvectorizer(stop_words='english') feature = vectorizer.fit_transform(documents) true_k = 3 km = kmeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) km.fit(feature) n in range(9):     print("doc %d belongs cluster %d. " % (n, km.predict(feature[n]))) 

and get:

doc 0 belongs cluster 2.  doc 1 belongs cluster 1.  doc 2 belongs cluster 2.  doc 3 belongs cluster 2.  doc 4 belongs cluster 1.  doc 5 belongs cluster 0.  doc 6 belongs cluster 0.  doc 7 belongs cluster 0.  doc 8 belongs cluster 1.  

take @ user guide of scikit-learn


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -