python - Kmeans: Terms occurring in more than one cluster? -

May 15, 2012

using kmeans tf-idf vectorizer possible terms occurring in more 1 cluster?

here dataset of examples:

documents = ["human machine interface lab abc computer applications",              "a survey of user opinion of computer system response time",              "the eps user interface management system",              "system , human system engineering testing of eps",              "relation of user perceived response time error measurement",              "the generation of random binary unordered trees",              "the intersection graph of paths in trees",              "graph minors iv widths of trees , quasi ordering",              "graph minors survey"]

i use tf-idf vectorizer feature extraction:

vectorizer = tfidfvectorizer(stop_words='english') feature = vectorizer.fit_transform(documents) true_k = 3 km = kmeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) km.fit(feature) order_centroids = km.cluster_centers_.argsort()[:, ::-1] print "top terms per cluster:" in range(true_k):     print "cluster %d:" % i,     ind in order_centroids[i, :10]:         print ' %s,' % terms[ind],     print

when cluster documents using kmeans scikit-learn, results below:

top terms per cluster: cluster 0:  user,  eps,  interface,  human,  response,  time,  computer,  management,  engineering,  testing, cluster 1:  trees,  intersection,  paths,  random,  generation,  unordered,  binary,  graph,  interface,  human, cluster 2:  minors,  graph,  survey,  widths,  ordering,  quasi,  iv,  trees,  engineering,  eps,

we can see terms occur in more 1 cluster(e.g, graph in cluster 1 , 2,eps in cluster 0 , 2).

are cluster results wrong? or acceptable because tf-idf score terms above each document different?

i think bit confused on trying do. code use gives clustering of documents, not terms. terms dimensions clustering.

if want find cluster each document belongs need use predict or fit_predict method, this:

vectorizer = tfidfvectorizer(stop_words='english') feature = vectorizer.fit_transform(documents) true_k = 3 km = kmeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) km.fit(feature) n in range(9):     print("doc %d belongs cluster %d. " % (n, km.predict(feature[n])))

and get:

doc 0 belongs cluster 2.  doc 1 belongs cluster 1.  doc 2 belongs cluster 2.  doc 3 belongs cluster 2.  doc 4 belongs cluster 1.  doc 5 belongs cluster 0.  doc 6 belongs cluster 0.  doc 7 belongs cluster 0.  doc 8 belongs cluster 1.

take @ user guide of scikit-learn

Search This Blog

CSS

python - Kmeans: Terms occurring in more than one cluster? -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -