constants - OpenCL - The difference between Buffer and global memory -


in opencl, buffers conduit through data communicated host application.

cl_mem clcreatebuffer (cl_context context, cl_mem_flags flags, size_t size,                        void *host_ptr, cl_int *errcode_ret); 

now if have buffer a_buffer flaged read_only, , kernel is:

__kernel void two_buffer_double(__global float* a) {     int = get_global_id(0);     float b = a[i] * 2; } 

my question that: a_buffer global memory or constant memory? should use __constant qualifier a. connection between cl_mem_flags(read_only , read_write) , memory qualifier(global , constant)?

__constant 

qualifier used constant memory , cards in texture cache , independent bandwidth __global limited in size.

__global __read_only * float 

means, opencl implementation try put in cache(or use other data path) if hardware sees fit __global limited vram size or fraction instead of 64kb(for example) __constant.

these qualifiers device-side optimization.

at host-side optimization, should supply

cl_mem_read_only  

as flag buffer creation. means device read it(probably using dma/pcie access/caching optimizations) can written host side(as being host c# c++ code, not device) using enqueuewrite or map unmap operations.

__constant 

is parametric constant definitions, not data processing.

if writing filter code, data __global , filter mask __constant if cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower private) accessing mask bytes not decrease data bandwidth.

now answering question:

" a_buffer global memory or constant memory? "

it global device side(kernel side) because declared __global anywhere on host side(hardware).

edit: for host-side, depends other flags used, example, use_host_ptr makes directly-accessible system ram , there virtual buffer on device side, without , cl_mem_read_write device memory have real buffer , mapped shadow in ram (as sub-step clenqueueread or clenqueuewrite) , copying visit shadow first uploaded gpu.

an example device: intel(r) hd (tm) graphics 400 in 4gb ddr3l laptop:

query                                           value cl_device_max_constant_buffer_size                 65536 bytes cl_device_global_mem_cache_size                   262144 bytes cl_device_global_mem_size                     1636414260 bytes  cl_device_global_mem_cache_type               cl_read_write_cache cl_device_local_mem_size                      65536(vs constant, benchmark it) cl_device_local_mem_type                      cl_local(so faster global)   

you cannot query private memory size mid-segment gaming amd card, 256kb per thread group. if set 64 threads per group, can use 4kb register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling global memory.


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -