Parallel Programming 7
Memory Consistency
[๊ทธ๋ฆผ1]
x=0, y=0์ธ ๊ฒฝ์ฐ๊ฐ ์กด์ฌํจ!
Reordering of memory operations to different addresses
Memory coherence vs Memory consistency
- Memory coherence : requirements for the observed bebavior of reads and writes to the same memory location
- Memory consistency : the behavior of reads and writes to different locations (as observed by other processor)
ย
3 classes of memory consistency models
- Sequential Consistency (SC) - MIPS, PA-RISC
- Total Store Order (TSO) - x86, SPARC
- Release Consistency (RC) - ARM, Itanium, PowerPC
ย
Sequential Consistency MIPS, PA-RISC
Out-of-Order with SC : ์์๊ฐ ๋ฐ๋ instruction์ด ๋ค์ด์ค๋ฉด invalidation ์ํค๊ณ , ์ด์ instruction์ ์คํํ๋ค.
ย
Total Store Ordering x86, SPARC
Reason : hiding store miss latency using write buffer
[๊ทธ๋ฆผ2]
Allows reads to proceed in front of prior stores
- When store is issued, processor buffers store in write buffer
- Write buffering relaxes W->R ordering (W->R ๊ท์น์ ์ํํ๋ค. ์ฆ, ๋ฌด์ํ๋ค)
ย
Partial Store Ordering ARM, PowerPC
[๊ทธ๋ฆผ3]
Thread 2์์ flag = 1๋ก ๋ฐ๋์ด์ loop๋ฅผ ๋น ์ ธ๋์๋ a ๊ฐ์ด 0์ผ ์ ์๋ค. ( initial state : 0 )
Reason : simplifying out-of-order execution, allow compiler optimizations
(allow compiler optimizations example)
ย
TSO, PSO์์ [๊ทธ๋ฆผ1]์ ์์๊ฐ ๋ฐ์ํ ์ ์๋ค
PSO์์ [๊ทธ๋ฆผ 3]์ ์์๊ฐ ๋ฐ์ํ ์ ์๋ค
4๊ฐ์ง ์์๋ฅผ ๋ชจ๋ ๋ฌด์ํ๋ฉด performance๊ฐ ์ฆ๊ฐํ๋ค. ํ์ง๋ง synchronization๋ฌธ์ ๊ฐ ๋ฐ์!
Unsynchronized programs contain data races. The output of the program depends on relative speed of processors
fence (memory barrier)๋ฅผ ์ฌ์ฉํ๋ฉด ํด๊ฒฐ ๊ฐ๋ฅํ์ง๋ง expensiveํ๋ค.
Summary
Motivation : obtain higher performance by allowing reordering of memory operations(SW complexity ์ฆ๊ฐ)
ย
Implementing locks
Desirable lock performance characteristics
- Low latency
- If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
- Low interconnect traffic
- If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
- Scalability
- Latency/ traffic should scale reasonably with number of processors
- Low storage cost
- Fairness
- Avoid starvation or substantial unfairness
- processors should acquire lock in the order they request access to it
Test-and-Set lock
- Repeated stores by multiple processors costly
- Generating a ton of useless interconnect traffic
Test-and-test-and-set lock
- Slightly higher latency than test-and-set in uncontended case (test๊ณผ์ ์ด 2๋ฒ์ด๋ ์๋ค)
- Generates much less interconnect traffic
- Still no provisions for fairness
Test-and-set lock with back off
- Same uncontended latency as test-and-set, but potentially higher latency under contention
- Exponential back-off can cause severe unfairness
- Newer requests back off for shorter intervals
Ticker lock
Main problem with test-and-set style locks : upon release, all waiting processors attempt to acquire lock using test-and-set
- No atomic operation needed to acquire the lock
- Only one invalidation per lock release