Parallel Programming 7

November 22, 2021

Memory Consistency

[그림1]

x=0, y=0인 경우가 존재함!

Reordering of memory operations to different addresses

Memory coherence vs Memory consistency

Memory coherence : requirements for the observed bebavior of reads and writes to the same memory location
Memory consistency : the behavior of reads and writes to different locations (as observed by other processor)

3 classes of memory consistency models

Sequential Consistency (SC) - MIPS, PA-RISC
Total Store Order (TSO) - x86, SPARC
Release Consistency (RC) - ARM, Itanium, PowerPC

Sequential Consistency MIPS, PA-RISC

Out-of-Order with SC : 순서가 바뀐 instruction이 들어오면 invalidation 시키고, 이전 instruction을 실행한다.

Total Store Ordering x86, SPARC

Reason : hiding store miss latency using write buffer

[그림2]

Allows reads to proceed in front of prior stores

When store is issued, processor buffers store in write buffer
Write buffering relaxes W->R ordering (W->R 규칙을 완화한다. 즉, 무시한다)

Partial Store Ordering ARM, PowerPC

[그림3]

Thread 2에서 flag = 1로 바뀌어서 loop를 빠져나와도 a 값이 0일 수 있다. ( initial state : 0 )

Reason : simplifying out-of-order execution, allow compiler optimizations

(allow compiler optimizations example)

TSO, PSO에서 [그림1]의 예시가 발생할 수 있다

PSO에서 [그림 3]의 예시가 발생할 수 있다

4가지 순서를 모두 무시하면 performance가 증가한다. 하지만 synchronization문제가 발생!

Unsynchronized programs contain data races. The output of the program depends on relative speed of processors

fence (memory barrier)를 사용하면 해결 가능하지만 expensive하다.

Summary

Motivation : obtain higher performance by allowing reordering of memory operations(SW complexity 증가)

Implementing locks

Desirable lock performance characteristics

Low latency
- If lock is free and no other processors are trying to acquire it, a processor should be able to acquire the lock quickly
Low interconnect traffic
- If all processors are trying to acquire lock at once, they should acquire the lock in succession with as little traffic as possible
Scalability
- Latency/ traffic should scale reasonably with number of processors
Low storage cost
Fairness
- Avoid starvation or substantial unfairness
- processors should acquire lock in the order they request access to it

Test-and-Set lock

Repeated stores by multiple processors costly
Generating a ton of useless interconnect traffic

Test-and-test-and-set lock

Slightly higher latency than test-and-set in uncontended case (test과정이 2번이나 있다)
Generates much less interconnect traffic
Still no provisions for fairness

Test-and-set lock with back off

Same uncontended latency as test-and-set, but potentially higher latency under contention
Exponential back-off can cause severe unfairness
- Newer requests back off for shorter intervals

Ticker lock

Main problem with test-and-set style locks : upon release, all waiting processors attempt to acquire lock using test-and-set

No atomic operation needed to acquire the lock
Only one invalidation per lock release

Lee

Parallel Programming 7

Memory Consistency

Memory coherence vs Memory consistency

3 classes of memory consistency models

Summary

Implementing locks

You may also enjoy

Parallel Programming 8

Parallel Programming 6

Parallel Programming 5

Parallel Programming 4