lsa - Document Similarity in R -

January 15, 2013

i new text mining , semantic analysis. trying find similarity of documents. first approach, using jaccard_similarity find documents similar. have read, jaccard_similarity find similar words, perhaps not best approach finding similar documents, thought give try anyway.

the problem facing following. have 200 type documents, , 1000000 b type documents. need see (if any) of documents similar b documents, not need see if type documents similar between them, or if b documents similar between them.

i found "textreuse" package in r perform this. following:

minhash <- minhash_generator(100, seed = 235)  corpus <- textreusecorpus(text=as.character(data$text),                           tokenizer=tokenize_ngrams, n=5,                           minhash_func=minhash)  buckets <- lsh(corpus, bands = 50, progress = false)  candidates <- lsh_candidates(buckets)  scores <- lsh_compare(candidates, corpus, jaccard_similarity,                        progress = true)

where data$text subsample of complete a+b documents. issue there many documents and, approach, doing unnecessary comparisons (a1 a2 documents, example, or b1 b2).

is there way in faster way? or there proper way semantic analysis?

thank you.

Search This Blog

CSS

lsa - Document Similarity in R -

Comments

Post a Comment

Popular posts from this blog

php - trouble displaying mysqli database results in correct order -

depending on nth recurrence of job in control M -

sql server - Cannot query correctly (MSSQL - PHP - JSON) -