CPU Architecture

How processors execute instructions. From the von Neumann model through pipelining, superscalar execution, and the memory hierarchy that makes it all practical.

Architecture Models

Von Neumann Architecture

Proposed by John von Neumann in 1945 (building on Eckert and Mauchly's ENIAC work). The key insight: store both program instructions and data in the same memory. A single bus connects CPU and memory. This means instructions and data compete for the same memory bandwidth — the "von Neumann bottleneck." Nearly all modern general-purpose processors are fundamentally von Neumann machines, though with modifications.

Harvard Architecture

Uses separate memories (and buses) for instructions and data. Named after the Harvard Mark I (1944). Eliminates the von Neumann bottleneck by allowing simultaneous instruction fetch and data access. Used in DSPs (digital signal processors) and microcontrollers (PIC, AVR). Most modern CPUs use a "modified Harvard" approach: separate L1 instruction and data caches (Harvard at L1), but unified main memory (von Neumann at DRAM).

RISC vs CISC

CISC (Complex Instruction Set Computer) — variable-length instructions, many addressing modes, instructions can directly operate on memory. x86 is the canonical example. Complex instructions are decoded into micro-ops internally.
RISC (Reduced Instruction Set Computer) — fixed-length instructions, load/store architecture (only load/store touch memory, all computation on registers), simple addressing modes. ARM, RISC-V, MIPS. Easier to pipeline.
Modern reality — the distinction has blurred. x86 processors internally decode CISC instructions into RISC-like micro-ops and execute them on a RISC-style pipeline. Performance comes from microarchitecture, not ISA complexity.

Control Unit

Fetch, decode, sequence

ALU + Registers

Compute + fast storage

Memory

Instructions + Data

System Bus (Address + Data + Control)

Key Components

Control Unit — fetches instructions, decodes them, generates control signals. Hardwired (fast, inflexible) or microcoded (slower, programmable). Modern x86 uses microcode for complex instructions.
Datapath — ALU, register file, multiplexers, buses. The "plumbing" that moves and transforms data. Width (32-bit, 64-bit) determines the native word size.
Register File — small, fast storage inside the CPU. RISC-V has 32 integer registers (x0-x31, x0 hardwired to 0). x86-64 has 16 general-purpose registers (RAX-R15). Architectural registers are mapped to a larger physical register file in OoO processors.
Program Counter (PC) — holds the address of the next instruction. Incremented after fetch, modified by branches and jumps.

🧠 A CPU is like the brain of a computer. It follows a simple loop: FETCH an instruction, DECODE what it means, EXECUTE it, STORE the result. It does this billions of times per second! Your phone's CPU does about 3 billion of these cycles every single second.

Pipeline Stages

Pipelining is the single most important technique in processor design. Like an assembly line in a factory, it overlaps the execution of multiple instructions. While instruction N is being executed, instruction N+1 is being decoded, and instruction N+2 is being fetched. Throughput increases proportionally to the number of stages (ideally).

Fetch

→

Decode

→

Execute

→

MEM

Memory

→

Write Back

Each stage executes in 1 clock cycle; instructions overlap in different stages

Classic 5-Stage RISC Pipeline

IF (Instruction Fetch) — read the instruction from the instruction cache (I-cache) using the program counter (PC). Increment PC to point to the next instruction. Branch prediction happens here.
ID (Instruction Decode) — decode the instruction opcode, read source registers from the register file. Generate control signals. Detect hazards. In RISC-V, decode is simple because all instructions are 32 bits with regular field positions.
EX (Execute) — ALU performs the operation (add, subtract, compare, shift). For loads/stores, the ALU computes the effective memory address (base register + offset). Branch target address is calculated here.
MEM (Memory Access) — loads read from data cache (D-cache); stores write to D-cache. Non-memory instructions pass through this stage without doing anything (pipeline stage still takes one cycle).
WB (Write Back) — write the result back to the destination register in the register file. For loads, the data read from memory is written. For ALU instructions, the computation result is written.

Pipeline Performance

Throughput — in the ideal case, one instruction completes per cycle (CPI = 1), regardless of how many stages. The pipeline doesn't make individual instructions faster; it makes the throughput higher.
Latency — each instruction still takes 5 cycles from start to finish. More stages means higher latency per instruction but higher clock frequency (shorter critical path per stage).
Deep pipelines — Intel Pentium 4 (Prescott) had 31 stages, targeting high clock speeds (3.8 GHz). But deep pipelines amplify branch misprediction penalties and hazard costs. Modern designs use 12-20 stages as a balanced compromise.
CPI (Cycles Per Instruction) — ideal CPI is 1 for a scalar pipeline. Real CPI is higher due to hazards, cache misses, and branch mispredictions. Superscalar processors target CPI < 1 (IPC > 1).

Pipeline Hazards

Hazards are situations that prevent the next instruction from executing in its designated clock cycle. They are the main reason real pipelines don't achieve ideal CPI of 1.

Data Hazards

RAW (Read After Write) — instruction N writes a register that instruction N+1 reads. The most common hazard. Solution: forwarding (bypass) — route the result from the EX/MEM stage directly to the input of the next instruction's EX stage, bypassing the register file write.
Load-use hazard — a special RAW case. A load instruction's result isn't available until end of MEM stage, but the next instruction needs it in EX. Cannot be fully solved by forwarding alone; requires a 1-cycle pipeline stall (bubble). Compilers reorder instructions to fill the "load delay slot."
WAR and WAW — Write After Read and Write After Write. Not problems in simple in-order pipelines (writes happen in order). Become issues in out-of-order processors, solved by register renaming.

Control Hazards

Branch hazards — when a branch is taken, the instructions fetched after the branch (from the fall-through path) are wrong and must be flushed. In a 5-stage pipeline, the branch outcome is known in EX (3rd stage), causing a 2-cycle penalty.
Mitigation — branch prediction (guess the outcome), delayed branches (fill the slot with a useful instruction — used in MIPS, not RISC-V), branch target buffers (cache predicted target addresses).

Structural Hazards

Two instructions need the same hardware resource simultaneously. Example: a single-ported memory accessed by both IF (instruction fetch) and MEM (data access) in the same cycle. Solution: separate instruction and data caches (modified Harvard), or multi-ported register files. Good hardware design eliminates most structural hazards.

Superscalar & Out-of-Order Execution

Superscalar Processors

A superscalar processor can issue multiple instructions per clock cycle. A 4-wide superscalar can fetch, decode, and issue up to 4 instructions per cycle, achieving IPC (Instructions Per Cycle) greater than 1. All modern high-performance CPUs are superscalar: Apple M-series (8-wide decode), AMD Zen 5 (6-wide), Intel Golden Cove (6-wide).

Out-of-Order Execution

Instructions are executed not in program order, but as soon as their operands are ready. This extracts Instruction-Level Parallelism (ILP) that in-order processors miss. Invented by Robert Tomasulo at IBM in 1967 for the System/360 Model 91.

Issue Queue / Reservation Stations — instructions wait here until their source operands are ready. When all operands are available, the instruction is dispatched to an execution unit.
Register Renaming — maps architectural registers to a larger pool of physical registers. Eliminates WAR and WAW hazards by giving each write a unique physical register. Uses a RAT (Register Alias Table) to track the mapping.
Reorder Buffer (ROB) — maintains program order for commits. Instructions execute out-of-order but retire (commit results to architectural state) in order. This ensures precise exceptions: if an interrupt occurs, the processor state reflects exactly the instructions that completed in program order.
Execution Units — modern CPUs have multiple ALUs, FPUs, load/store units, branch units. AMD Zen 4 has 4 ALUs, 2 AGUs (Address Generation Units), 2 FP/SIMD pipes per core.

Limits of ILP

In practice, IPC rarely exceeds 3-4 for general-purpose code. Limitations: true data dependencies (can't be renamed away), limited branch prediction accuracy, cache misses that stall the pipeline, and the instruction window size (how far ahead the processor can look). This "ILP wall" motivated the shift to multi-core processors around 2005.

Branch Prediction

Branches make up 15-25% of all instructions. Without prediction, a branch creates a pipeline bubble equal to the pipeline depth. Modern predictors achieve 95-99% accuracy, critical for deep pipelines where misprediction costs 15-20 cycles.

Static Prediction

Always not-taken — simplest. Assume fall-through. Correct ~40% of the time.
Backward taken, forward not-taken (BTFN) — backward branches are usually loops (taken); forward branches are usually if-statements (often not-taken). Correct ~65% of the time.
Compiler hints — the compiler can mark branches as likely/unlikely. GCC's __builtin_expect(), C++20's [[likely]]/[[unlikely]].

Dynamic Prediction

1-bit predictor — remember last outcome. Problem: mispredicts twice on every loop exit (at the end: taken to not-taken, at the start of next iteration: not-taken to taken).
2-bit saturating counter — four states: strongly taken, weakly taken, weakly not-taken, strongly not-taken. Must mispredict twice to change prediction. Handles loop exits with only one misprediction. First used in Intel i486.
Correlating (two-level) predictor — uses the outcome of recent branches (global history register, GHR) to predict the current branch. Captures patterns like "if (a) ... if (b) ..." where outcomes are correlated.
Tournament predictor — combines a local predictor and a global predictor with a chooser that selects which one to trust based on past accuracy. Used in Alpha 21264, AMD processors.
TAGE (TAgged GEometric) — state-of-the-art predictor. Multiple tables indexed with geometric history lengths (e.g., 8, 16, 32, 64, 128 bits of history). First described by Andre Seznec (2006). Used in AMD Zen, Intel since Haswell.
Perceptron predictor — uses a linear perceptron (neural network with one layer) per branch. Weights are updated on mispredictions. AMD Zen (2017) was the first commercial perceptron predictor.

Branch Target Buffer (BTB)

A cache that stores the target address of recently taken branches. Indexed by the PC of the branch instruction. Without a BTB, even a correct direction prediction is useless because the target address isn't known until decode. The BTB allows fetching from the predicted target in the very next cycle. Modern BTBs hold 4K-16K entries.

Return Address Stack (RAS)

A small hardware stack that predicts return addresses for function calls. On a CALL instruction, the return address is pushed; on a RET, it is popped. Very high accuracy (>99%) because call/return patterns are regular. Typical depth: 16-32 entries. Overflow wraps around.

Cache Hierarchy

Caches exploit locality of reference to bridge the speed gap between the CPU and main memory. Without caches, a modern 5 GHz processor would spend most of its time waiting for DRAM (100ns latency = 500 wasted cycles).

Cache Levels

L1 Cache — smallest (32-64 KB per core), fastest (4-5 cycle latency, ~1ns). Split into L1 instruction cache (L1i) and L1 data cache (L1d). Closest to the CPU pipeline.
L2 Cache — larger (256 KB - 1 MB per core), slower (12-14 cycles, ~3-5ns). Unified (holds both instructions and data). Private per core in most modern designs.
L3 Cache (LLC) — largest (8-96 MB), shared among all cores. 30-50 cycle latency (~10-15ns). Acts as a victim cache and reduces traffic to main memory. Intel uses inclusive L3; AMD Zen uses victim L3.
Main Memory (DRAM) — gigabytes of capacity but ~70-100ns latency (200-300+ CPU cycles). DDR5 bandwidth: 50-60 GB/s per channel.

Cache Organization

Direct-mapped — each memory address maps to exactly one cache line. Simple, fast, but suffers from conflict misses (two addresses mapping to the same line evict each other).
Set-associative — each address maps to a set of N lines (N-way associative). Reduces conflicts. 8-way or 16-way is common for L1. Requires N comparators in parallel.
Fully associative — any line can go anywhere. No conflict misses. Too expensive for large caches (requires comparing every tag). Used for TLBs and small special-purpose caches.
Cache line size — typically 64 bytes. Exploits spatial locality. When a miss occurs, the entire line is fetched from the next level. Too large wastes bandwidth on unused data; too small misses spatial locality opportunities.

Write Policies

Write-back — writes go only to the cache; the cache line is marked dirty. Written to the next level only on eviction. Reduces write traffic. Used in L1/L2.
Write-through — every write goes to both cache and the next level. Simpler coherence but more write traffic. Sometimes used for L1 with a write buffer.
Write-allocate — on a write miss, fetch the line into cache, then write. Paired with write-back.
No-write-allocate — on a write miss, write directly to the next level without fetching into cache. Paired with write-through.

Cache Coherence

In multi-core processors, each core has its own L1/L2 cache. If core A writes to an address cached by core B, B's copy becomes stale. Coherence protocols ensure all cores see a consistent view of memory.

MESI protocol — each cache line is in one of four states: Modified (dirty, exclusive), Exclusive (clean, only copy), Shared (clean, other copies exist), Invalid. Used by Intel. AMD uses MOESI (adds Owned state to reduce write-backs).
Snooping — all caches monitor (snoop) a shared bus for writes by other cores. Simple but doesn't scale beyond ~8 cores due to bus bandwidth.
Directory-based — a centralized directory tracks which caches hold each line. Scales to many cores. Used in NUMA (Non-Uniform Memory Access) systems and server processors.

📚 Imagine your cache is like a desk, and RAM is like a bookshelf across the room. When you're doing homework, you keep the books you need on your desk (fast to grab). If a book isn't there, you walk to the shelf (slow). CPUs do the same thing — they keep frequently used data in tiny, super-fast caches so they don't have to wait for slow main memory!

The Memory Wall

First described by Wm. A. Wulf and Sally A. McKee in 1995. The core problem: CPU speed has improved ~1000x since 1980, but DRAM latency has improved only ~10x. This growing gap means memory access is increasingly the bottleneck, not computation.

Impact

A modern CPU running at 5 GHz can execute 20+ billion operations per second, but a single DRAM access takes ~60-100ns (300-500 cycles of waiting).
On cache-unfriendly workloads (pointer chasing, random access, large working sets), the CPU spends 50-80% of its time waiting for memory.
Adding more cores doesn't help if they all compete for the same memory bandwidth.

Mitigations

Cache hierarchy — L1/L2/L3 caches intercept 90-99% of memory accesses. 3D-stacked SRAM (e.g., AMD 3D V-Cache, adding 64MB of L3) further reduces misses.
Prefetching — hardware detects access patterns (stride, stream) and fetches data before it's needed. Software prefetch instructions (__builtin_prefetch()) provide hints.
Out-of-order execution — the processor continues executing independent instructions while waiting for a cache miss, hiding memory latency.
Simultaneous Multithreading (SMT) — two hardware threads share one core. When one thread stalls on a cache miss, the other thread uses the execution units. Intel calls this Hyper-Threading.
HBM (High Bandwidth Memory) — 3D-stacked DRAM on an interposer, providing 1-2 TB/s bandwidth. Used in GPUs and AI accelerators (NVIDIA H100: 3.35 TB/s).
Processing-in-Memory (PIM) — move computation to where the data lives. Samsung PIM-DRAM and UPMEM's PIM chips embed small processors inside DRAM dies.

Software Implications

Understanding the memory wall is essential for writing fast software. Data structure layout matters enormously: arrays of structs vs structs of arrays, cache-friendly access patterns, avoiding pointer chasing, minimizing TLB misses (huge pages), and aligning data to cache line boundaries. The fastest algorithm on paper can be the slowest in practice if it has poor cache behavior.