Parallel Programming 5

GPU Architecture & CUDA Programming

Add ALUs to increase compute capability

Same instruction broadcast to all ALUs

This operation is executed in parallel on all ALUs

Implicit SIMD

  • Compiler generates a binary with scalar instructions
  • But N instances of the program are always run together on the processor
  • Hardware is responsible for simultaneously executing the same instruction from multiple program instances on different data on SIMD ALUs

CUDA

Data level parallelism

CUDA programs consists of a hierarchy of concurrent threads

image

( 3x2 block size, 4x3 threads per block , block์€ HW์— ์˜ํ•ด schduling๋˜์–ด์ง )

ย 

CUDA device memory model

image

cudaMalloc(&deviceA, length); //device address space ํ• ๋‹น

cudaMemcpy(deviceA, data, length, cudaMemcpyHostToDevice); //Host๋ฐ์ดํ„ฐ device๋กœ ๋ณต์‚ฌ

  • Shared memory
    • Readable/writable by all threads in block
  • Per-thread private memory
    • Readable/writable by thread

GPU implementation maps thread blocks to cores using a dynamic scheduling policy that respects resource requirements

1 warp = 32 threads

Threads in a warp are executed in a SIMD manner

These 32 logical CUDA threads share an instruction stream and therefore performance can suffer due to divergent execution

SM(Streaming Multiprocessor) core is capable of concurrently executing multiple CUDA thread blocks

ย 

Pinned memory

image

๋ณต์‚ฌํ•˜๋Š” ๊ณผ์ •์„ ์ค„์ผ ์ˆ˜ ์žˆ์–ด ์‹œ๊ฐ„์ด ๋น ๋ฆ„

cudaHostAlloc()

cudaFreeHost()

CUDA streams

  • CUDA supports parallel execution of kernels and cudaMemcpy with โ€œstreamsโ€
  • Operations(tasks) in different streams can go in parallel

image