c++ - cuSparse Throughput goes down when number of iterations increases -


i trying calculate maximum throughput possible gpu sprase matrix * dense vector multiplication utilizing many compute resources possible.

in order accomplish tried 2 methods:

  1. allocate memory x , on host. store x , on host. allocate memory x , on device. store x , on device. start timer. perform sparse matrix*dense vector multiplication via cusparsecsrmv in loop, , run cusparsecsrmv num_iterations times. stop timer. copy y device host , check result accuracy.

  2. allocate memory x , on host. store x , on host. allocate memory array of x , on device (i.e. x[num_imps], a[num_imps]). store x , on device. start timer. perform sparse matrix*dense vector multiplication via cusparsecsrmv in loop, , run cusparsecsrmv num_imps times on each a[i]*x[i]. stop timer. copy y[num_imps-1] device host , check result accuracy.

here code method 1:

// == start timer measuring multiplication ==  queryperformancefrequency(&frequency1);  queryperformancecounter(&startingtime1);    // sparse matrix * dense vector multiplication  /* exercise level 2 routines (csrmv) */  (int = 0; < num_iterations; i++) {  status = cusparsescsrmv(handle, cusparse_operation_non_transpose, m, n, nnz,  &alpha, descr, cooval, csrrowptr, coocolindex,  &xval[0], &beta, &y[0]);  }   // == end time measuring multiplication ==  queryperformancecounter(&endingtime1);  elapsedmicroseconds1.quadpart = endingtime1.quadpart - startingtime1.quadpart;  elapsedmicroseconds1.quadpart *= 1000000;  elapsedmicroseconds1.quadpart /= frequency1.quadpart; 

here code method 2:

// == start timer measuring multiplication == queryperformancefrequency(&frequency1); queryperformancecounter(&startingtime1);  (int = 0; < num_imps; i++) { status = cusparsescsrmv(handle_array[i], cusparse_operation_non_transpose, m, n, nnz,     &alpha, descr_array[i], cooval_array[i], csrrowptr_array[i], coocolindex_array[i],     &xval_array[i][0], &beta, &y_array[i][0]); }  // == end time measuring multiplication == queryperformancecounter(&endingtime1); elapsedmicroseconds1.quadpart = endingtime1.quadpart - startingtime1.quadpart; elapsedmicroseconds1.quadpart *= 1000000; elapsedmicroseconds1.quadpart /= frequency1.quadpart; 

if num_iterations or num_imps = 1, same throughput. if num_imps = 10, throughput maxes out. once num_imps = 100 or more, throughput starts decrease. num_iterations starts increase, once set num_iterations super large number, 100,000 throughput drop below throughput num_iterations = 1.

why happening? expect thorughput saturate @ point , not able go higher, but not decrease.

my thoughts due multiple calls cusparsecsrmv gpu bogs down, or perhaps gpu needs cool down slows down, throughput goes down, these don't seem reasonable conclusions me.

quoting documentation:

the cusparse library functions executed asynchronously respect host , may return control application on host before result ready. developers can use cudadevicesynchronize() function ensure execution of particular cusparse library routine has completed.

there nothing wrong here except way timing. @ present measuring time enqueue library calls, not time run them. reasonable expect enqueue performance drop once have many tens or hundreds of operations queued up.


Comments

Popular posts from this blog

sql server - Cannot query correctly (MSSQL - PHP - JSON) -

php - trouble displaying mysqli database results in correct order -

C++ Linked List -