constants - OpenCL - The difference between Buffer and global memory -
in opencl, buffers conduit through data communicated host application.
cl_mem clcreatebuffer (cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret);
now if have buffer a_buffer
flaged read_only
, , kernel is:
__kernel void two_buffer_double(__global float* a) { int = get_global_id(0); float b = a[i] * 2; }
my question that: a_buffer
global memory or constant memory? should use __constant
qualifier a
. connection between cl_mem_flags(read_only
, read_write
) , memory qualifier(global
, constant
)?
__constant
qualifier used constant memory , cards in texture cache , independent bandwidth __global limited in size.
__global __read_only * float
means, opencl implementation try put in cache(or use other data path) if hardware sees fit __global limited vram size or fraction instead of 64kb(for example) __constant.
these qualifiers device-side optimization.
at host-side optimization, should supply
cl_mem_read_only
as flag buffer creation. means device read it(probably using dma/pcie access/caching optimizations) can written host side(as being host c# c++ code, not device) using enqueuewrite or map unmap operations.
__constant
is parametric constant definitions, not data processing.
if writing filter code, data __global , filter mask __constant if cannot fit in __private memory(which has ultimate bandwidth) or __local memory(slower private) accessing mask bytes not decrease data bandwidth.
now answering question:
" a_buffer global memory or constant memory? "
it global device side(kernel side) because declared __global anywhere on host side(hardware).
edit: for host-side, depends other flags used, example, use_host_ptr makes directly-accessible system ram , there virtual buffer on device side, without , cl_mem_read_write device memory have real buffer , mapped shadow in ram (as sub-step clenqueueread or clenqueuewrite) , copying visit shadow first uploaded gpu.
an example device: intel(r) hd (tm) graphics 400 in 4gb ddr3l laptop:
query value cl_device_max_constant_buffer_size 65536 bytes cl_device_global_mem_cache_size 262144 bytes cl_device_global_mem_size 1636414260 bytes cl_device_global_mem_cache_type cl_read_write_cache cl_device_local_mem_size 65536(vs constant, benchmark it) cl_device_local_mem_type cl_local(so faster global)
you cannot query private memory size mid-segment gaming amd card, 256kb per thread group. if set 64 threads per group, can use 4kb register space per thread or half of it(because of compiler optimizations) before getting slow because of spilling global memory.
Comments
Post a Comment