c++ - CUDA Race Check Hazard after __syncthreads() -
i trying process matrix in parallel in cuda. need compute each column of matrix against given vector , if result greater value keep column, otherwise column removed further computation. avoid copying , restructuring matrix used column indices indicate whether column should used further computation.
this process needs done multiple times. each time subset of columns needs checked. created matrix store column indices process each time. example, if have matrix of 10 columns , need repeat process 4 times, column_indices
matrix may this:
thrust::device_vector<int> column_indices( std::vector<int>( { 0, 1, -1, -1, -1, // 2 columns contains useful information 5, 6, 7, -1, -1, // 3 columns contains useful information 9, 8, 7, 6, -1, // 4 columns contains useful information 4, 3, 2, 1, 0 // 5 columns contains useful information } ) );
this simplified example. in real code have process matrix 500-1000 columns. because not columns need processed each time , number of columns big, may not idea pass each column thread processing, means maybe half of threads idle.
so decided use dynamic parallelism - parent kernel checks how many threads needed process , launch child kernel exact number of threads , allocate exact shared memory needed.
here code:
#include <iostream> #include <thrust/count.h> #include <thrust/device_vector.h> #include <thrust/execution_policy.h> #include <thrust/sort.h> __device__ float calculate( const float* v1, const float* v2, const int length ) { // mock calculation resulting 0.0 threads , 0.5 odd threads return threadidx.x % 2 == 0 ? 0.0f : 0.5f; } __global__ void child( float const* input_a, const int nrow, float const* input_b, int* columns, int* counts ) { extern __shared__ float results[]; // input_a matrix stored in column-major order, , input_b vector int thread_column = columns[ threadidx.x ]; float const* thread_input = input_a+ thread_column * nrow; results[ threadidx.x ] = calculate( thread_input, input_b, nrow ); //--------------discussion----------- //race condition gone if replace line above this: //atomicexch( results + threadidx.x, calculate( thread_input, input_b, nrow ) ); //however looks me unnecessary each thread accessing different address //----------------------------------- __syncthreads(); if ( threadidx.x == 0 ) { // sort column indices in descending results order indices removed @ end of indices thrust::sort_by_key( thrust::seq, results, results + blockdim.x, columns, thrust::greater<float>() ); // count number of indices removed int remove_count = thrust::count( thrust::seq, results, results + blockdim.x, 0.0f ); *counts -= remove_count; } } __global__ void parent( float const* inputs, const int nrow, float const* output, int* column_indices, int* column_counts, const int column_size ) { int row_per_group = blockdim.x; int group_num = blockidx.x, row_num = threadidx.x; int tid = group_num * row_per_group + row_num; int* indices_for_this_block = column_indices + tid * column_size; int* count_for_this_block = column_counts + tid; // launch child kernels process row int block_size = *count_for_this_block; if ( block_size > 0 ) { child<<< 1, block_size, sizeof( float ) * block_size >>>( inputs, nrow, output, indices_for_this_block, count_for_this_block ); cudadevicesynchronize(); } } int main() { thrust::device_vector<int> column_indices( std::vector<int>( { 0, 1, -1, -1, -1, // 2 columns contains useful information 5, 6, 7, -1, -1, // 3 columns contains useful information 9, 8, 7, 6, -1, // 4 columns contains useful information 4, 3, 2, 1, 0 // 5 columns contains useful information } ) ); thrust::device_vector<int> column_count( std::vector<int>( { 2, 3, 4, 5 } ) ); // processing column_indices in 2 groups , each group process 2 rows // because mocking correlation results, don't need real data, pass nullptr data pointer. parent<<< 2, 2 >>>( nullptr, 0, nullptr, column_indices.data().get(), column_count.data().get(), 5 ); //--------------discussion----------- // race condition gone if launch parent kernel this: //parent<<< 2, 2, sizeof( float ) * 5 >>>( // nullptr, 0, nullptr, column_indices.data().get(), column_count.data().get(), 5 //); // when total number of column big, approach fail exceeds maximum capacity of shared memory // (although fraction of allocation used). //----------------------------------- cudadevicesynchronize(); std::cout << "row #0: "; std::copy( column_indices.begin(), column_indices.begin() + column_count[ 0 ], std::ostream_iterator<int>( std::cout, ", " ) ); std::cout << std::endl; std::cout << "row #1: "; std::copy( column_indices.begin() + 5, column_indices.begin() + 5 + column_count[ 1 ], std::ostream_iterator<int>( std::cout, ", " ) ); std::cout << std::endl; std::cout << "row #2: "; std::copy( column_indices.begin() + 10, column_indices.begin() + 10 + column_count[ 2 ], std::ostream_iterator<int>( std::cout, ", " ) ); std::cout << std::endl; std::cout << "row #3: "; std::copy( column_indices.begin() + 15, column_indices.begin() + 15 + column_count[ 3 ], std::ostream_iterator<int>( std::cout, ", " ) ); std::cout << std::endl; }
running above code, got correct results:
row #0: 1, row #1: 6, row #2: 8, 6, row #3: 3, 1,
however, cuda-memcheck
seems complain potential race conditions this:
========= warn:(warp level programming) potential raw hazard detected @ __shared__ 0x13 in block (0, 0, 0) : ========= write thread (4, 0, 0) @ 0x00000070 in /path_to_file/main.cu:23:child(float const *, int, float const *, int*, int*) ========= read thread (0, 0, 0) @ 0x00000648 in /usr/local/cuda/include/thrust/system/detail/sequential/insertion_sort.h:109:child(float const *, int, float const *, int*, int*) ========= current value : 0
line #23 in main.cu line:
results[ threadidx.x ] = calculate( thread_input, input_b, nrow );
and reading thread seems related to:
thrust::sort_by_key( thrust::seq, results, results + blockdim.x, columns, thrust::greater<float>() );
but why happen between 2 lines separated __syncthreads()?
i don't understand why happening.
- with example, each child block have 5 threads.
- i called
__syncthreads()
before letting thread 0 process calculated results. - my understanding shared memory private each block (maybe problem came from). multiple launches of child kernel should not interfere each other.
- if modify code (as outlined in discussion section in code), can remove racing condition. why these work , other doesn't?
could please let me know did wrong? thank much!
at time (through cuda 8.0), cuda-memcheck
racecheck tool does not support dynamic parallelism.
Comments
Post a Comment