lsa - Document Similarity in R -


i new text mining , semantic analysis. trying find similarity of documents. first approach, using jaccard_similarity find documents similar. have read, jaccard_similarity find similar words, perhaps not best approach finding similar documents, thought give try anyway.

the problem facing following. have 200 type documents, , 1000000 b type documents. need see (if any) of documents similar b documents, not need see if type documents similar between them, or if b documents similar between them.

i found "textreuse" package in r perform this. following:

minhash <- minhash_generator(100, seed = 235)  corpus <- textreusecorpus(text=as.character(data$text),                           tokenizer=tokenize_ngrams, n=5,                           minhash_func=minhash)  buckets <- lsh(corpus, bands = 50, progress = false)  candidates <- lsh_candidates(buckets)  scores <- lsh_compare(candidates, corpus, jaccard_similarity,                        progress = true) 

where data$text subsample of complete a+b documents. issue there many documents and, approach, doing unnecessary comparisons (a1 a2 documents, example, or b1 b2).

is there way in faster way? or there proper way semantic analysis?

thank you.


Comments

Popular posts from this blog

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -