Parallel Programming 7

Memory Consistency

image

[๊ทธ๋ฆผ1]

x=0, y=0์ธ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•จ!

Reordering of memory operations to different addresses

Memory coherence vs Memory consistency

  • Memory coherence : requirements for the observed bebavior of reads and writes to the same memory location
  • Memory consistency : the behavior of reads and writes to different locations (as observed by other processor)

ย 

3 classes of memory consistency models

  • Sequential Consistency (SC) - MIPS, PA-RISC
  • Total Store Order (TSO) - x86, SPARC
  • Release Consistency (RC) - ARM, Itanium, PowerPC

ย 

Sequential Consistency MIPS, PA-RISC

image

Out-of-Order with SC : ์ˆœ์„œ๊ฐ€ ๋ฐ”๋€ instruction์ด ๋“ค์–ด์˜ค๋ฉด invalidation ์‹œํ‚ค๊ณ , ์ด์ „ instruction์„ ์‹คํ–‰ํ•œ๋‹ค.

ย 

Total Store Ordering x86, SPARC

image

Reason : hiding store miss latency using write buffer

image

[๊ทธ๋ฆผ2]

Allows reads to proceed in front of prior stores

  • When store is issued, processor buffers store in write buffer
  • Write buffering relaxes W->R ordering (W->R ๊ทœ์น™์„ ์™„ํ™”ํ•œ๋‹ค. ์ฆ‰, ๋ฌด์‹œํ•œ๋‹ค)

ย 

Partial Store Ordering ARM, PowerPC

image

image

[๊ทธ๋ฆผ3]

Thread 2์—์„œ flag = 1๋กœ ๋ฐ”๋€Œ์–ด์„œ loop๋ฅผ ๋น ์ ธ๋‚˜์™€๋„ a ๊ฐ’์ด 0์ผ ์ˆ˜ ์žˆ๋‹ค. ( initial state : 0 )

Reason : simplifying out-of-order execution, allow compiler optimizations

image

(allow compiler optimizations example)

ย 

TSO, PSO์—์„œ [๊ทธ๋ฆผ1]์˜ ์˜ˆ์‹œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค

PSO์—์„œ [๊ทธ๋ฆผ 3]์˜ ์˜ˆ์‹œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค

4๊ฐ€์ง€ ์ˆœ์„œ๋ฅผ ๋ชจ๋‘ ๋ฌด์‹œํ•˜๋ฉด performance๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค. ํ•˜์ง€๋งŒ synchronization๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒ!

Unsynchronized programs contain data races. The output of the program depends on relative speed of processors

fence (memory barrier)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ expensiveํ•˜๋‹ค.

Summary

Motivation : obtain higher performance by allowing reordering of memory operations(SW complexity ์ฆ๊ฐ€)

ย 

Implementing locks

Desirable lock performance characteristics

  • Low latency
    • If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
  • Low interconnect traffic
    • If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
  • Scalability
    • Latency/ traffic should scale reasonably with number of processors
  • Low storage cost
  • Fairness
    • Avoid starvation or substantial unfairness
    • processors should acquire lock in the order they request access to it

Test-and-Set lock

  • Repeated stores by multiple processors costly
  • Generating a ton of useless interconnect traffic

Test-and-test-and-set lock

  • Slightly higher latency than test-and-set in uncontended case (test๊ณผ์ •์ด 2๋ฒˆ์ด๋‚˜ ์žˆ๋‹ค)
  • Generates much less interconnect traffic
  • Still no provisions for fairness

Test-and-set lock with back off

  • Same uncontended latency as test-and-set, but potentially higher latency under contention
  • Exponential back-off can cause severe unfairness
    • Newer requests back off for shorter intervals

Ticker lock

Main problem with test-and-set style locks : upon release, all waiting processors attempt to acquire lock using test-and-set

  • No atomic operation needed to acquire the lock
  • Only one invalidation per lock release