Parallel Programming 4

Performance Optimization : Locality, Communication, and Contention

Total communication time : overhead + occupancy + network delay

Pipelined communication์—์„œ memory bandwidth-bound execution!

ย 

Arithmetic intensity = amount of computation(e.g., instructions)/amount of communication(e.g., bytes)

-> ๋†’์„์ˆ˜๋ก ์ข‹๋‹ค

Inherent communication

image

image

์˜ค๋ฅธ์ชฝ์˜ grid๊ฐ€ communication cost๊ฐ€ ๋” ํฌ๋‹ค

ย 

  • Inherent communication : information that fundamentally must be moved between processors to carry out the algorithm given the specified assignment
  • Artifactual communication : all other communication

image

4๊ฐœ์˜ elements๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ๋งˆ๋‹ค loads three lines์„ ํ•ด์•ผํ•œ๋‹ค.

image

๋‹ค์Œ๊ณผ ๊ฐ™์ด grid๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค๋ฉด 6๊ฐœ์˜ elements๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ๋งˆ๋‹ค load two cache lines

์ฆ‰, spatial locality ๋ฅผ ์ž˜ ํ™œ์šฉํ•˜์ž!

ย 

Contention

image

  • Flat communication
  • Tree structured communication

ย 

Reducing communication costs

  • Reduce overhead of communication to sender/receiver
  • Reduce latency of communication
  • Reduce contention
  • Increase communication/computation overlap
    • asynchronous communication, pipelining, multi-threading, pre-fetching, out-of-order execution