FPGA Design

Field-Programmable Gate Arrays let you build custom digital hardware without a billion-dollar fab. From blinking LEDs to AI accelerators, FPGAs are the fastest path from idea to working silicon.

What is an FPGA?

A Field-Programmable Gate Array is an integrated circuit that can be configured by the user after manufacturing. Unlike an ASIC (Application-Specific Integrated Circuit), which is hardwired at the factory, an FPGA can be reprogrammed to implement any digital circuit that fits within its resources. The first commercial FPGA was the Xilinx XC2064, released in 1985 with just 64 Configurable Logic Blocks.

Why FPGAs Matter

Reconfigurability — change the design without manufacturing a new chip. Fix bugs, add features, adapt to new standards. A single FPGA board can be a CPU today, a video encoder tomorrow.
Time-to-market — no 6-12 month ASIC fabrication cycle. Synthesize, place, route, program — in minutes to hours. Iterate rapidly.
Parallelism — unlike a CPU that executes instructions sequentially, an FPGA runs all configured logic simultaneously. A 1000-stage pipeline? Done. 256 parallel multipliers? Done. Custom hardware for your specific algorithm.
Low volume economics — ASICs cost $10M-$100M+ for the mask set at advanced nodes. FPGAs cost $5-$50K per device but zero NRE (Non-Recurring Engineering) for masks. Breakeven is typically 10K-100K units.
Prototyping — validate ASIC designs on FPGAs before tape-out. Running at lower clock speeds but with real hardware behavior. Catches bugs that simulation misses.

Where FPGAs Are Used

Telecommunications — 5G base stations, network switches, protocol processing. Huawei, Ericsson, Nokia all use FPGAs extensively.
Data centers — Microsoft Project Catapult (network acceleration), Amazon F1 instances (custom FPGA accelerators), SmartNICs for DPUs.
Defense & aerospace — radar processing, electronic warfare, satellite communications. Radiation-hardened FPGAs (Microchip RTG4) for space.
Automotive — ADAS sensor fusion, infotainment, in-vehicle networking. Xilinx/AMD Zynq UltraScale+ in autonomous driving.
High-frequency trading — sub-microsecond latency for order execution. FPGAs process market data and generate orders in hardware, bypassing OS and software stacks.
AI inference — custom quantized neural network accelerators. Microsoft uses FPGAs for Bing search ranking and Azure AI.

🎨 An FPGA is like a LEGO board for circuits. Instead of building one fixed thing (like a CPU chip), you can rearrange the pieces to build ANYTHING — a video processor today, a robot brain tomorrow! You can reprogram it over and over. It's like having a chip that can shapeshift into whatever hardware you need.

FPGA Architecture

An FPGA consists of an array of configurable logic blocks (CLBs) connected by a programmable routing network, surrounded by I/O blocks (IOBs) and embedded hard IP blocks.

CLBs

LUTs + Flip-flops

Logic fabric

BRAM

18/36 Kbit blocks

On-chip SRAM

DSP Slices

27x18 multiply-accum

Math acceleration

I/O Blocks

Multi-standard

LVDS, LVCMOS

Transceivers

32-58 Gbps

PCIe, Ethernet

Hard IP

ARM cores, DDR ctrl

Fixed silicon

Major Components

Configurable Logic Blocks (CLBs) — the computing fabric. Each CLB contains lookup tables (LUTs), flip-flops, carry chains, and multiplexers. The LUTs implement arbitrary Boolean functions; the flip-flops store state.
I/O Blocks (IOBs) — interface between internal logic and external pins. Support multiple I/O standards (LVCMOS, LVDS, SSTL, HSTL) and voltage levels. Configurable as input, output, or bidirectional with programmable drive strength and slew rate.
Block RAM (BRAM) — embedded SRAM blocks, typically 18 or 36 Kbit each. Configurable as single-port, dual-port, or simple-dual-port RAM/ROM. Used for buffers, FIFOs, register files. Xilinx UltraScale+ has up to 2,160 BRAM blocks.
DSP Slices — hardened multiply-accumulate units. Each DSP slice has a 27x18-bit multiplier, 48-bit accumulator, pre-adder, and pattern detector. Used for signal processing, neural networks, and math-heavy applications. A large FPGA has 2,000-12,000 DSP slices.
Clock Management — PLLs (Phase-Locked Loops) and MMCMs (Mixed-Mode Clock Managers) generate and manipulate clock signals. Clock distribution networks (global, regional, local) minimize skew across the die.
Transceivers — high-speed serial I/O for PCIe, Ethernet, HDMI, USB, SATA. Modern FPGAs support 32-58 Gbps per lane (GTY/GTM on AMD Versal). Used for multi-gigabit communication links.
Hard IP — dedicated silicon for common functions: PCIe controllers, DDR memory controllers, Ethernet MACs, ARM processor cores (Zynq), AI engines (Versal). Faster and more power-efficient than implementing these in fabric.

LUTs & CLBs

The Lookup Table (LUT) is the fundamental logic element of an FPGA. It can implement any Boolean function of its inputs by storing the truth table in SRAM cells.

HDL Design

→

Synthesis

→

Place & Route

→

Timing Analysis

→

Bitstream

How a LUT Works

A k-input LUT stores 2^k SRAM bits. The input signals serve as the address; the stored bit at that address is the output.
A 6-input LUT (6-LUT, standard in modern FPGAs) stores 64 bits and can implement any function of 6 variables. It's essentially a 64x1 ROM whose contents are loaded during configuration.
Some LUTs can be split: a 6-LUT can function as two independent 5-LUTs with shared inputs (Xilinx) or as a small distributed RAM or shift register.
Functions larger than 6 inputs are decomposed across multiple LUTs, connected by the routing network.

CLB Structure (Xilinx 7-Series Example)

Each CLB contains 2 slices. Each slice contains 4 LUT6s, 8 flip-flops, carry chain logic, and multiplexers.
Some slices (SLICEM) can configure LUTs as 64-bit distributed RAM or 32-bit shift registers, in addition to normal logic.
The carry chain provides fast ripple-carry for arithmetic (addition, subtraction, comparison) without using the general routing network. Critical for performance of adders and counters.
The 8 flip-flops per slice can be configured as edge-triggered D flip-flops or level-sensitive latches, with optional clock enable and synchronous/asynchronous reset.

Intel/Altera: ALMs

Intel FPGAs use Adaptive Logic Modules (ALMs) instead of CLBs. An ALM has an 8-input adaptive LUT that can be configured as two independent functions (up to 6 inputs each, or one function up to 7 inputs), plus 4 registers, adder logic, and register packing. Functionally similar to Xilinx CLBs but with different granularity and flexibility trade-offs.

🔢 A LUT (lookup table) is the brain cell of an FPGA. It's basically a tiny cheat sheet: for every possible combination of inputs, it stores the answer. A 6-input LUT stores 64 answers. By loading different cheat sheets into thousands of LUTs, you can make the FPGA do anything — it's like reprogramming the hardware itself!

Routing

The programmable routing network connects CLBs, BRAMs, DSPs, and I/O blocks. Routing typically consumes 50-80% of FPGA area and is often the bottleneck for timing closure.

Routing Architecture

Switch boxes — at the intersection of horizontal and vertical routing channels. Contain programmable pass transistors or multiplexers that connect wire segments.
Wire segments — different lengths: single (span 1 CLB), double (2 CLBs), quad (4 CLBs), long (span the entire chip). Short wires are numerous but slow for distant connections; long wires are fast but scarce.
Connection boxes — connect CLB pins to the routing channels. Not all pins connect to all wires; the connection pattern is carefully designed to balance flexibility and area.
Routing delay — wire delay dominates in large FPGAs (not gate delay). A signal crossing the die may pass through 10-20 switch boxes, each adding ~200-500ps. This is why FPGA clock speeds (200-800 MHz) are lower than ASICs (2-5 GHz).

Timing Closure

Meeting timing constraints means every path from one flip-flop to another (through combinational logic and routing) completes within one clock period. When the longest path is too slow, the designer must: reduce logic depth (pipeline more), guide placement (floorplan constraints), use dedicated resources (carry chains, DSPs), or lower the clock frequency. Timing closure is often the hardest part of FPGA design.

Design Flow

The FPGA design flow transforms HDL code into a bitstream that configures the FPGA.

Steps

1. Design Entry — write Verilog/VHDL/SystemVerilog or use block design (IP integrator). Define the top-level module with I/O connected to FPGA pins.
2. Simulation — verify functionality with testbenches before synthesis. Fix bugs here — it's orders of magnitude easier than debugging on hardware.
3. Synthesis — convert RTL to a netlist of FPGA primitives (LUTs, FFs, BRAMs, DSPs). The synthesizer optimizes for area, speed, or power based on constraints.
4. Implementation — (a) Translate/map netlist to specific FPGA resources. (b) Place each element on the FPGA die. (c) Route connections between placed elements using the programmable routing network.
5. Timing Analysis — static timing analysis (STA) checks all paths against clock constraints. Reports setup/hold slack. Negative slack means timing is not met — go back to step 1 or 4.
6. Bitstream Generation — produce the binary file that configures the FPGA. For Xilinx: .bit file. For Intel: .sof or .pof file.
7. Programming — load the bitstream via JTAG, SPI flash, or other interface. SRAM-based FPGAs (most common) must be reprogrammed on every power-up from external flash. Flash-based FPGAs (Microchip) retain configuration without external storage.

Tools: Vivado & Quartus

AMD/Xilinx Vivado

Design suite for Xilinx 7-series, UltraScale, UltraScale+, and Versal FPGAs. Free WebPack edition covers smaller devices (Artix-7, Zynq-7000, Spartan-7).
IP Integrator: graphical block design for connecting IP cores (AXI interconnect, DMA, processor systems). Essential for Zynq SoC designs.
Vivado HLS (now Vitis HLS): compile C/C++ to RTL for FPGA acceleration. Write algorithms in C, generate Verilog — with caveats about quality of results.
ILA (Integrated Logic Analyzer): embed a logic analyzer in the FPGA to capture signals in real-time. Invaluable for on-chip debugging.

Intel Quartus Prime

Design suite for Intel/Altera FPGAs (Cyclone, Arria, Stratix, Agilex). Free Lite edition covers Cyclone and MAX devices.
Platform Designer (formerly Qsys): system integration tool for Nios II soft processors and Avalon interconnect.
Signal Tap: equivalent of Vivado ILA — embedded logic analyzer for real-time signal capture.
Intel oneAPI: FPGA acceleration using SYCL/DPC++ — higher-level than HLS, targeting data center accelerator cards.

Open-Source Toolchain

Yosys — open-source synthesis for Verilog. Supports Lattice iCE40 and ECP5 (fully open), partial support for Xilinx 7-series (via Symbiflow/F4PGA).
nextpnr — open-source place-and-route. Targets iCE40 (complete), ECP5 (complete), Gowin (in progress). Fast, deterministic, and hackable.
Project IceStorm — reverse-engineered bitstream format for Lattice iCE40. Enabled the first fully open-source FPGA toolchain. Pioneering work by Clifford Wolf (creator of Yosys).
F4PGA (formerly Symbiflow) — FOSS FPGA framework targeting Xilinx 7-series and QuickLogic. Uses Yosys for synthesis and VPR or nextpnr for place-and-route.
Amaranth — Python HDL that generates Verilog and integrates with the open-source toolchain. Clean API, great for rapid prototyping on iCE40/ECP5 boards.

Common FPGA Projects

Beginner

LED blinker — the "Hello, World" of FPGAs. Counter-driven LED toggle. Teaches clock domains, pin constraints, and the build flow.
Seven-segment display driver — BCD to 7-segment decoder. Multiplexed display for multi-digit output. Combinational logic practice.
UART transmitter/receiver — serial communication at 9600-115200 baud. Shift registers, baud rate generators, FSMs. First "useful" peripheral.
Debouncer — filter mechanical switch bounce. Counter-based debounce circuit. Seemingly simple, surprisingly instructive about metastability and clock domains.

Intermediate

VGA/HDMI controller — generate video signals from the FPGA. H-sync, V-sync, pixel clock, frame buffer. Outputs to a monitor. Great for understanding timing-critical design.
SPI/I2C controller — master or slave for common serial protocols. Interface with sensors, displays, memory chips. Practical and widely applicable.
FIFO (First-In, First-Out) — asynchronous FIFO for clock domain crossing. Dual-port RAM, Gray code pointers, full/empty flags. A fundamental building block.
PWM controller — pulse-width modulation for motor control, LED dimming, audio. Counter-based with configurable duty cycle and frequency.

Advanced

RISC-V soft CPU — implement a RISC-V processor on the FPGA. PicoRV32 (small), VexRiscv (configurable), NEORV32 (full-featured). Run C code on your own CPU.
DDR memory controller — interface with external DDR3/DDR4 SDRAM. Extremely timing-sensitive. Usually use vendor IP (MIG for Xilinx) but building one teaches deep hardware design.
Ethernet MAC — 10/100/1000 Mbps Ethernet. MII/GMII interface, CRC calculation, frame parsing. Network-on-chip applications.
Neural network accelerator — custom datapath for CNN/DNN inference. Fixed-point or INT8 matrix multiply using DSP slices. Real-world AI-at-the-edge.

FPGA vs ASIC

Comparison

Performance — ASICs run 5-10x faster (higher clock) and 10-100x lower power. FPGA routing overhead (switch boxes, configurable interconnect) adds delay and capacitance that custom ASIC metal layers avoid.
Cost per unit — FPGA: $5-$50,000 per chip (depending on size). ASIC: $0.10-$50 per chip at volume, but $10M-$100M+ NRE for mask set, design, and verification.
Time to market — FPGA: days to weeks. ASIC: 12-24 months from RTL-freeze to working silicon.
Flexibility — FPGA: reprogram anytime. ASIC: fixed forever (or until the next tape-out, costing millions).
Volume crossover — at low volume (<10K units), FPGA is cheaper total. At high volume (>100K units), ASIC wins on per-unit cost despite high NRE. The exact crossover depends on the FPGA/ASIC size and process node.
Power — for equivalent functionality, an ASIC uses 10-50x less power. FPGA configuration SRAM and routing overhead consume substantial static power. This matters in battery-powered and data center applications.

When to Choose FPGA

Low to medium volume production (<10K-50K units)
Requirements may change after deployment (field-upgradable)
Prototyping before ASIC tape-out
Time-critical deployment (can't wait for ASIC fab)
Applications needing hardware acceleration but not extreme power/performance

When to Choose ASIC

High volume (>100K units) — unit cost dominates
Extreme performance or power requirements (mobile phones, data center)
Stable, well-understood design (low risk of respins)
Competitive market where performance-per-watt matters