von Neumann architecture: shared memory banks + buses for instructions and data
Harvard architecture: seperate memory banks + buses for instructions and data
CPU-Memory Gap: increasing difference in speeds between DRAM, disk, and CPU speeds
Runtime of a program determined by:
Performance largely measured through throughput, latency, and clock speed.
CPI = clock cycles per instruction = average CPI = effective CPI
CPU time = CPI $\times$ instruction count $\times$ clock cycle
power wall: limitation in processor development preventing processors from consuming any more power
The memory hierarchy was introduced to tackle the memory wall and reduce the effects of the Processor-Memory gap. Also:
temporal locality: recently accessed items are likely to be accessed again in the near future
spatial locality: recently accessed items are likely to have nearby items accessed in the near future
CPU looks in L1 cache, then L2, then L3, then main memory.
NOTE: despite the numbering, L3 is a “lower level cache” than L2, which is a lower level cache than L1.
placement (mapping) policy: where should a new block to be loaded in the cache go? replacement policy: which block should be evicted from the cache (to load a new block)
cold miss (at level $k$): occurs when a block is missing for the first time at level $k$; these are unavoidable
conflict miss (at level $k$): mapping policies dictate where blocks can go in the cache: conflict misses occur when multiple data items from level $k + 1$ map to the same position in level $k$
capacity miss (at level $k$): a conflict miss that occurs because the cache is full: these would occur even if the cache was fully-associative with LRU replacement
With caching, the CPU time formula changes:
CPU Time = instruction count $\times$ clock cycle $\times$ (CPI + average memory stall cycles)
Average memory stall cycles = access rate $\times$ miss rate $\times$ miss penalty
(CPI + average memory stall cycles) is called CPI stall CPI inside CPI stall is also called CPI ideal
banked cache: cache divided into two sections: one for instructions and one for data: L1 is usually banked cache
unified cache: cache where instructions and data are stored together
average memory access time (AMAT): average time to access memory considering both hits and misses
AMAT = hit time + miss penalty $\times$ miss rate
For multiple cache levels:
AMAT = L1 hit time + L1 miss rate $\times$ L1 miss penalty
L1 miss penalty = L2 hit time + L2 miss rate $\times$ L2 miss penalty
…
L3 miss penalty = Main memory hit time
Hierarchy management:
The cache is split into $S$ sets, with $B$ blocks in each. Each block has a tag field used to differentiate between different items currently loaded in the cache.
A memory address is decomposed as the tag | set index | block offset
Addresses are assumed to be byte addresses (not word addresses).
direct-mapped: one block per set; each memory address mapped to exactly one line in the cache; no tag needed
fully associative: one set: memory address can be mapped to any block; tag is whole address except block offset
N-way set associative: N sets
Read hit: do nothing
Write hit policies (data only)
Read miss policies:
Write miss policies (data only): stall execution and
Reducing hit time:
Reducing miss rate:
Reducing miss penalty:
conjunctive normal form (CNF): conjuction of disjunctions
disjunctive normal form (DNF): disjunction of conjunctions
A set of operators is functionally complete if it is enough to describe any operation in boolean algebra
combinational circuits = stateless circuits = functional blocks
block (schematic) diagram: specifies inputs, outputs, number of bits for each, and formula/truth table
1-bit half-adder: XOR gate for sum, AND for carry
1-bit full-adder:
n-bit full-adder:
n-bit subtractor:
multiplexer: select between multiple inputs with control signal $c$:
\[MUX(a, b, c) = a \bar c + b c\]demultiplxer chooses among its outputs
delay flip-flop: sets state to input after some delay
$D$ | $Q$ | $Q_{next}$ |
---|---|---|
0 | 0 | |
1 | 1 |
toggle flip-flop: $T$ toggles state if $T \equiv 1$, otherwise does nothing
SR flip-flop:
JK flip-flop: SR flip-flop except both set to 1 toggles the state
parallel-in parallel-out:
serial-in parallel-out:
serial-in serial-out: one input bit, one output bit
parallel-in serial-out: one bit output at a time, bits shifted over on each output; requires additional control signal for write/shift operation
clk-to-q delay: propagation delay of a flip-flop
setup time: stable signal value to the input of a flip-flop required before the rising edge of the clock
hold time: stable signal value to the input of a flip-flop required after the rising edge of the clock
clocked accumulator:
minimum clock period = combinational circuit delay + clk-to-q delay + setup time
Mealy machine: FSM whose output depends on current state and current input
MIPS ISA is a RISC architecture with pipelining and without interlocking.
MIPS is big-endian.
32 32-bit registers
R-type (register) instructions
opcode | source 1 | source 2 | destination | shift amount | function type |
---|---|---|---|---|---|
6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |
shamt
(shift amount), source 1 is always 0
sll $s0, $t0, 4
$t0
$s0
I-type (immediate) instructions
opcode | source 1 | source 2 (or destination) | immediate |
---|---|---|---|
6 bits | 5 bits | 5 bits | 16 bits |
sw $t1, 32($t0)
, $t1
is the destination and $t0
is the sourceJ-type (jump) instructions
opcode | target (jump address) |
---|---|
6 bits | 26 bits |
j
and jal
instructionstarget
is multiplied by 4 and OR-ed with the upper 4-bits of the PC to form the new PCminimum clock cycle needs to be long enough to run all 5 stages in one clock cycle
Multiplexers needed in datapath for: