Parallel Programming 5

November 21, 2021

GPU Architecture & CUDA Programming

Add ALUs to increase compute capability

Same instruction broadcast to all ALUs

This operation is executed in parallel on all ALUs

Implicit SIMD

Compiler generates a binary with scalar instructions
But N instances of the program are always run together on the processor
Hardware is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs

Data level parallelism

CUDA programs consists of a hierarchy of concurrent threads

( 3x2 block size, 4x3 threads per block , block은 HW에 의해 schduling되어짐 )

CUDA device memory model

cudaMalloc(&deviceA, length); //device address space 할당

cudaMemcpy(deviceA, data, length, cudaMemcpyHostToDevice); //Host데이터 device로 복사

GPU implementation maps thread blocks to cores using a dynamic scheduling policy that respects resource requirements

1 warp = 32 threads

Threads in a warp are executed in a SIMD manner

These 32 logical CUDA threads share an instruction stream and therefore performance can suffer due to divergent execution

SM(Streaming Multiprocessor) core is capable of concurrently executing multiple CUDA thread blocks

복사하는 과정을 줄일 수 있어 시간이 빠름

cudaHostAlloc()

cudaFreeHost()