Parallel Programming 5
GPU Architecture & CUDA Programming
Add ALUs to increase compute capability
Same instruction broadcast to all ALUs
This operation is executed in parallel on all ALUs
Implicit SIMD
- Compiler generates a binary with scalar instructions
- But N instances of the program are always run together on the processor
- Hardware is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs
CUDA
Data level parallelism
CUDA programs consists of a hierarchy of concurrent threads
( 3x2 block size, 4x3 threads per block , block์ HW์ ์ํด schduling๋์ด์ง )
ย
CUDA device memory model
cudaMalloc(&deviceA, length); //device address space ํ ๋น
cudaMemcpy(deviceA, data, length, cudaMemcpyHostToDevice); //Host๋ฐ์ดํฐ device๋ก ๋ณต์ฌ
- Shared memory
- Readable/writable by all threads in block
- Per-thread private memory
- Readable/writable by thread
GPU implementation maps thread blocks to cores using a dynamic scheduling policy that respects resource requirements
1 warp = 32 threads
Threads in a warp are executed in a SIMD manner
These 32 logical CUDA threads share an instruction stream and therefore performance can suffer due to divergent execution
SM(Streaming Multiprocessor) core is capable of concurrently executing multiple CUDA thread blocks
ย
Pinned memory
๋ณต์ฌํ๋ ๊ณผ์ ์ ์ค์ผ ์ ์์ด ์๊ฐ์ด ๋น ๋ฆ
cudaHostAlloc()
cudaFreeHost()
CUDA streams
- CUDA supports parallel execution of kernels and cudaMemcpy with โstreamsโ
- Operations(tasks) in different streams can go in parallel