parallel processing - How to do parallelization k-means in R? -

March 15, 2011

i have large dataset (5000*100) , want use kmeans function find clusters. however, not know how use clusterapply function.

set.seed(88) mydata=rnorm(5000*100) mydata=matrix(data=mydata,nrow = 5000,ncol = 100)  parallel.a=function(i) { kmeans(mydata,3,nstart = i,iter.max = 1000) }   library(parallel) cl.cores <- detectcores()-1 cl <- makecluster(cl.cores) clustersetrngstream(cl,iseed=1234) fit.km = clusterapply(cl,x,fun=parallel.a(500)) stopcluster(cl)

the clusterapply requires 'x' value not know how set. also, difference between clusterapply, parsapply , parlapply? lot.

here's way use clusterapply perform parallel kmeans parallelizing on nstart argument (assuming greater one):

library(parallel) nw <- detectcores() cl <- makecluster(nw) clustersetrngstream(cl, iseed=1234) set.seed(88) mydata <- matrix(rnorm(5000 * 100), nrow=5000, ncol=100)  # parallelize on "nstart" argument nstart <- 100 # create vector of length "nw" sum(nstartv) == nstart nstartv <- rep(ceiling(nstart / nw), nw) results <- clusterapply(cl, nstartv,         function(n, x) kmeans(x, 3, nstart=n, iter.max=1000),         mydata) # pick best result <- sapply(results, function(result) result$tot.withinss) result <- results[[which.min(i)]] print(result$tot.withinss)

people typically export mydata workers, example passes additional argument clusterapply. makes sense (since number of tasks equal number of workers), more efficient (since combines export computation), , avoids creating global variable on cluster workers (which bit more tidy). (of course, exporting makes more sense if plan perform more computations on workers data set.)

note can use detectcores()-1 workers if like, benchmarking on machine shows performs faster detectcores() workers. suggest benchmark on machine see works better you.

as difference between different parallel functions, clusterapply parallel version of lapply processes each value of x in separate task. parlapply parallel version of lapply splits x such sends 1 task per cluster worker (which can more efficient). parsapply calls parlapply simplifies result in same way sapply simplifies result of calling lapply.

clusterapply makes sense parallel kmeans since manually splitting nstart such sends 1 task per cluster worker, making parlapply unnecessary.

Search This Blog

CSS

parallel processing - How to do parallelization k-means in R? -

Comments

Post a Comment

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

C++ Linked List -

How to set proxy only for a particular ansible task? -