๐Ÿ”ฒ Chip Design Institute
Educational Resources

CPU Architecture

How processors execute instructions. From the von Neumann model through pipelining, superscalar execution, and the memory hierarchy that makes it all practical.

Architecture Models

Von Neumann Architecture

Proposed by John von Neumann in 1945 (building on Eckert and Mauchly's ENIAC work). The key insight: store both program instructions and data in the same memory. A single bus connects CPU and memory. This means instructions and data compete for the same memory bandwidth โ€” the "von Neumann bottleneck." Nearly all modern general-purpose processors are fundamentally von Neumann machines, though with modifications.

Harvard Architecture

Uses separate memories (and buses) for instructions and data. Named after the Harvard Mark I (1944). Eliminates the von Neumann bottleneck by allowing simultaneous instruction fetch and data access. Used in DSPs (digital signal processors) and microcontrollers (PIC, AVR). Most modern CPUs use a "modified Harvard" approach: separate L1 instruction and data caches (Harvard at L1), but unified main memory (von Neumann at DRAM).

RISC vs CISC

Control Unit
Fetch, decode, sequence
ALU + Registers
Compute + fast storage
Memory
Instructions + Data
System Bus (Address + Data + Control)

Key Components

🧠 A CPU is like the brain of a computer. It follows a simple loop: FETCH an instruction, DECODE what it means, EXECUTE it, STORE the result. It does this billions of times per second! Your phone's CPU does about 3 billion of these cycles every single second.

Pipeline Stages

Pipelining is the single most important technique in processor design. Like an assembly line in a factory, it overlaps the execution of multiple instructions. While instruction N is being executed, instruction N+1 is being decoded, and instruction N+2 is being fetched. Throughput increases proportionally to the number of stages (ideally).

IF
Fetch
ID
Decode
EX
Execute
MEM
Memory
WB
Write Back
Each stage executes in 1 clock cycle; instructions overlap in different stages

Classic 5-Stage RISC Pipeline

Pipeline Performance

Pipeline Hazards

Hazards are situations that prevent the next instruction from executing in its designated clock cycle. They are the main reason real pipelines don't achieve ideal CPI of 1.

Data Hazards

Control Hazards

Structural Hazards

Two instructions need the same hardware resource simultaneously. Example: a single-ported memory accessed by both IF (instruction fetch) and MEM (data access) in the same cycle. Solution: separate instruction and data caches (modified Harvard), or multi-ported register files. Good hardware design eliminates most structural hazards.

Superscalar & Out-of-Order Execution

Superscalar Processors

A superscalar processor can issue multiple instructions per clock cycle. A 4-wide superscalar can fetch, decode, and issue up to 4 instructions per cycle, achieving IPC (Instructions Per Cycle) greater than 1. All modern high-performance CPUs are superscalar: Apple M-series (8-wide decode), AMD Zen 5 (6-wide), Intel Golden Cove (6-wide).

Out-of-Order Execution

Instructions are executed not in program order, but as soon as their operands are ready. This extracts Instruction-Level Parallelism (ILP) that in-order processors miss. Invented by Robert Tomasulo at IBM in 1967 for the System/360 Model 91.

Limits of ILP

In practice, IPC rarely exceeds 3-4 for general-purpose code. Limitations: true data dependencies (can't be renamed away), limited branch prediction accuracy, cache misses that stall the pipeline, and the instruction window size (how far ahead the processor can look). This "ILP wall" motivated the shift to multi-core processors around 2005.

Branch Prediction

Branches make up 15-25% of all instructions. Without prediction, a branch creates a pipeline bubble equal to the pipeline depth. Modern predictors achieve 95-99% accuracy, critical for deep pipelines where misprediction costs 15-20 cycles.

Static Prediction

Dynamic Prediction

Branch Target Buffer (BTB)

A cache that stores the target address of recently taken branches. Indexed by the PC of the branch instruction. Without a BTB, even a correct direction prediction is useless because the target address isn't known until decode. The BTB allows fetching from the predicted target in the very next cycle. Modern BTBs hold 4K-16K entries.

Return Address Stack (RAS)

A small hardware stack that predicts return addresses for function calls. On a CALL instruction, the return address is pushed; on a RET, it is popped. Very high accuracy (>99%) because call/return patterns are regular. Typical depth: 16-32 entries. Overflow wraps around.

Cache Hierarchy

Caches exploit locality of reference to bridge the speed gap between the CPU and main memory. Without caches, a modern 5 GHz processor would spend most of its time waiting for DRAM (100ns latency = 500 wasted cycles).

Cache Levels

Cache Organization

Write Policies

Cache Coherence

In multi-core processors, each core has its own L1/L2 cache. If core A writes to an address cached by core B, B's copy becomes stale. Coherence protocols ensure all cores see a consistent view of memory.

📚 Imagine your cache is like a desk, and RAM is like a bookshelf across the room. When you're doing homework, you keep the books you need on your desk (fast to grab). If a book isn't there, you walk to the shelf (slow). CPUs do the same thing โ€” they keep frequently used data in tiny, super-fast caches so they don't have to wait for slow main memory!

The Memory Wall

First described by Wm. A. Wulf and Sally A. McKee in 1995. The core problem: CPU speed has improved ~1000x since 1980, but DRAM latency has improved only ~10x. This growing gap means memory access is increasingly the bottleneck, not computation.

Impact

Mitigations

Software Implications

Understanding the memory wall is essential for writing fast software. Data structure layout matters enormously: arrays of structs vs structs of arrays, cache-friendly access patterns, avoiding pointer chasing, minimizing TLB misses (huge pages), and aligning data to cache line boundaries. The fastest algorithm on paper can be the slowest in practice if it has poor cache behavior.

Resources

Hennessy & Patterson: Computer Architecture (6th ed.)

The definitive graduate textbook. ILP, memory hierarchy, multiprocessors, domain-specific architectures. Turing Award winners' masterwork.

Textbook | Advanced

Patterson & Hennessy: Computer Organization and Design (RISC-V)

The undergraduate classic. Pipeline design, cache hierarchy, virtual memory, multiprocessors. Now in RISC-V edition.

Textbook | Intermediate

MIT 6.004: Computation Structures

From digital logic to pipelined processors. Full lecture videos, labs, and problem sets. Outstanding free course.

MIT OCW | Free

Dan Luu: Branch Prediction

Practical deep-dive into branch prediction. History, mechanisms, performance implications. Excellent technical blog post.

Blog | Free

uops.info

Detailed measurements of instruction latency, throughput, and port usage for Intel and AMD processors. Invaluable for micro-optimization.

Database | Free

Chips and Cheese

In-depth microarchitecture analysis. Cache measurements, branch predictor reverse-engineering, die shots. Community-driven hardware analysis.

Blog | Free