Computer Architecture

Introduction

von Neumann architecture: shared memory banks + buses for instructions and data

Harvard architecture: seperate memory banks + buses for instructions and data

CPU-Memory Gap: increasing difference in speeds between DRAM, disk, and CPU speeds

Eight Great Ideas in Computer Architecture

  1. Use abstraction to simplify design
  2. Design for Moore’s Law
  3. Make the common case fast
  4. Performance via parallelism
  5. Performance via pipelining
  6. Make the common case fast
  7. Hierarchy of memories
  8. Dependability via redundancy

CPU

Runtime of a program determined by:

  1. CPU clock speed
  2. Type of instructions performed (e.g. multiplication/division take longer than addition/subtraction)
  3. Memory speed (access time)

Performance largely measured through throughput, latency, and clock speed.

CPI

CPI = clock cycles per instruction = average CPI = effective CPI

CPU time = CPI $\times$ instruction count $\times$ clock cycle

Power

power wall: limitation in processor development preventing processors from consuming any more power

Memory Hierarchy

The memory hierarchy was introduced to tackle the memory wall and reduce the effects of the Processor-Memory gap. Also:

Principle of Locality

temporal locality: recently accessed items are likely to be accessed again in the near future

spatial locality: recently accessed items are likely to have nearby items accessed in the near future

Cache Hierarchy

CPU looks in L1 cache, then L2, then L3, then main memory.

NOTE: despite the numbering, L3 is a “lower level cache” than L2, which is a lower level cache than L1.

placement (mapping) policy: where should a new block to be loaded in the cache go? replacement policy: which block should be evicted from the cache (to load a new block)

cold miss (at level $k$): occurs when a block is missing for the first time at level $k$; these are unavoidable

conflict miss (at level $k$): mapping policies dictate where blocks can go in the cache: conflict misses occur when multiple data items from level $k + 1$ map to the same position in level $k$

capacity miss (at level $k$): a conflict miss that occurs because the cache is full: these would occur even if the cache was fully-associative with LRU replacement

Cache and Memory Performance

With caching, the CPU time formula changes:

CPU Time = instruction count $\times$ clock cycle $\times$ (CPI + average memory stall cycles)

Average memory stall cycles = access rate $\times$ miss rate $\times$ miss penalty

(CPI + average memory stall cycles) is called CPI stall CPI inside CPI stall is also called CPI ideal

banked cache: cache divided into two sections: one for instructions and one for data: L1 is usually banked cache

unified cache: cache where instructions and data are stored together

average memory access time (AMAT): average time to access memory considering both hits and misses

AMAT = hit time + miss penalty $\times$ miss rate

For multiple cache levels:

AMAT = L1 hit time + L1 miss rate $\times$ L1 miss penalty

L1 miss penalty = L2 hit time + L2 miss rate $\times$ L2 miss penalty

L3 miss penalty = Main memory hit time

Cache Implementation

Hierarchy management:

The cache is split into $S$ sets, with $B$ blocks in each. Each block has a tag field used to differentiate between different items currently loaded in the cache.

A memory address is decomposed as the tag | set index | block offset

Addresses are assumed to be byte addresses (not word addresses).

Cache Organization Schemes

direct-mapped: one block per set; each memory address mapped to exactly one line in the cache; no tag needed

fully associative: one set: memory address can be mapped to any block; tag is whole address except block offset

N-way set associative: N sets

Set Associativity Cost

Handling Cache Hits and Misses

Read hit: do nothing

Write hit policies (data only)

  1. write-through: maintain consistency between cache and main memory by writing to cache AND main memory; amortize performance by using a write buffer
  2. write-back: allow inconsistency, marking blocks as dirty when written to. Only write back to main memory on eviction (requires >2 cycles, 1 to check dirty and 1 to write-back; alternatively, use write-buffer for just 1 cycle)

Read miss policies:

  1. stall execution, fetch block from lower cache level, install into higher cache level, and resume

Write miss policies (data only): stall execution and

  1. write-allocate: fetch block from lower cache, write updated value, and install in cache
  2. no write-allocate: write to lower cache without installing into cache (or use a write buffer)

Cache Design

Reducing hit time:

Reducing miss rate:

Reducing miss penalty:

Boolean Algebra

conjunctive normal form (CNF): conjuction of disjunctions

disjunctive normal form (DNF): disjunction of conjunctions

A set of operators is functionally complete if it is enough to describe any operation in boolean algebra

Stateless Circuits

combinational circuits = stateless circuits = functional blocks

block (schematic) diagram: specifies inputs, outputs, number of bits for each, and formula/truth table

1-bit half-adder: XOR gate for sum, AND for carry

1-bit full-adder:

n-bit full-adder:

n-bit subtractor:

multiplexer: select between multiple inputs with control signal $c$:

\[MUX(a, b, c) = a \bar c + b c\]

demultiplxer chooses among its outputs

State Circuits

delay flip-flop: sets state to input after some delay

$D$ $Q$ $Q_{next}$
0   0
1   1

toggle flip-flop: $T$ toggles state if $T \equiv 1$, otherwise does nothing

SR flip-flop:

JK flip-flop: SR flip-flop except both set to 1 toggles the state

parallel-in parallel-out:

serial-in parallel-out:

serial-in serial-out: one input bit, one output bit

parallel-in serial-out: one bit output at a time, bits shifted over on each output; requires additional control signal for write/shift operation

clk-to-q delay: propagation delay of a flip-flop

setup time: stable signal value to the input of a flip-flop required before the rising edge of the clock

hold time: stable signal value to the input of a flip-flop required after the rising edge of the clock

clocked accumulator:

minimum clock period = combinational circuit delay + clk-to-q delay + setup time

Finite State Machines

Mealy machine: FSM whose output depends on current state and current input

  1. Generate truth table
  2. Simplify boolean expressions for next state and output bits
  3. Draw logic diagram

Introduction to MIPS

MIPS ISA is a RISC architecture with pipelining and without interlocking.

MIPS is big-endian.

MIPS Registers

32 32-bit registers

MIPS Instruction Formats

R-type (register) instructions

opcode source 1 source 2 destination shift amount function type
6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

I-type (immediate) instructions

opcode source 1 source 2 (or destination) immediate
6 bits 5 bits 5 bits 16 bits

J-type (jump) instructions

opcode target (jump address)
6 bits 26 bits

MIPS Single-Cycle Datapath

minimum clock cycle needs to be long enough to run all 5 stages in one clock cycle

Instruction Fetch

  1. Update value of PC on rising clock edge
  2. Fetch instruction from memory and pass to next stage
  3. Compute PC + 4

Instruction Decode

  1. Decompose instruction into bit segments
  2. Read opcode, determine R/I/J instruction type
  3. Access operand values from registers
  4. Extend to 32-bit immediate if needed

Execute

Memory Access

Write Back

Multiplexers needed in datapath for: