FPGA Design

An FPGA is a chip made of lookup tables that can be programmed to act as any digital circuit. That single sentence is technically accurate and completely fails to convey what's interesting about FPGAs: that they let you build custom hardware without manufacturing a chip, that "everything runs at the same time" in a way that breaks software intuitions, and that when you push past the intended operating conditions, digital circuits start behaving like analog ones in genuinely useful ways.

Thinking in Parallel

The hardest conceptual shift for software engineers coming to FPGA design isn't the syntax of Verilog — it's that there's no instruction pointer. In software, one thing happens after another. In hardware, everything described in your design is physically present on the chip and running simultaneously. An assign statement doesn't execute at a particular time; it describes a permanent wire. An always @(posedge CLK) block doesn't run once per clock cycle like a loop iteration; it describes a physical register that latches its input on every rising edge of the clock.¹

This means you don't "call a function" — you instantiate a module. If you need to add two numbers in three different places, you get three physical adders. If you need a pipeline, you literally build a pipeline: a chain of registers, each feeding the next, with data flowing through all of them simultaneously. A 70-stage rendering pipeline processes 70 pixels at once, one in each stage, producing a new pixel every clock cycle. This is how a small FPGA can ray-trace a 3D scene at 60 FPS without a frame buffer — each pixel is computed just in time as the video signal demands it, "racing the beam" exactly the way old CRT hardware did.²

State machines are how you execute algorithms in hardware. You define states, transitions, and what happens in each state, and the logic advances one state per clock cycle. Where a CPU has a program counter indexing into a stream of instructions, an FPGA has a state register indexing into a set of explicitly designed states. It's more like writing a finite automaton than writing a program, and it's why FPGA development feels closer to language design than to application programming — you're designing the machine, not programming it.

Formal Verification: Proving Instead of Testing

One of the most compelling developments in FPGA design is the accessibility of formal verification through open-source tools. SymbiYosys, built on top of Yosys and open-source SAT solvers, can exhaustively prove properties of your design for all possible inputs over a bounded number of clock cycles. This isn't fuzzing. It isn't high-coverage testing. It's mathematical proof.

The riscv-formal verification suite makes this concrete: for each RISC-V instruction, there's a formal check that the CPU core executes it correctly. Not "correctly for the test cases we thought of" — correctly for all possible machine states, all possible instruction sequences, all possible memory contents. One hobbyist designed a RISC-V core using SpinalHDL, verified it entirely through riscv-formal without ever running a simulation, and the first time he loaded it onto an FPGA board, it worked. The LED blinked. A CPU, bug-free on its first silicon run, designed by one person in a few weeks of evenings.³

The workflow has a distinctive rhythm. You write some hardware, run the formal tool, get a failing trace in seconds (not hours — formal finds bugs fast because it targets the shortest possible counterexample). You examine the VCD waveform for a handful of cycles, fix the bug, run again. The catch is that each fix can produce an entirely different failure — unlike simulation, where you incrementally peel the onion, formal reshuffles the deck after every change. But the bugs are always minimal: the shortest possible sequence of inputs that triggers the violation.

The limitation is runtime. Simple instruction checks finish in seconds. Pipeline liveness checks might take minutes. And some checks — forward progress proofs for complex pipelines — can take months or never terminate. One designer had a pc_fwd check running for 1,301 hours before giving up. There's no progress bar, no estimate — the SAT solver either finds a proof or it doesn't, and you can't predict which.³

Where Digital Becomes Analog

The most fascinating frontier in FPGA design is the boundary where digital circuits start exhibiting analog behavior. An FPGA is, at bottom, a collection of analog circuits that normally operate within bounds where they behave as ideal digital logic. Push past those bounds, and the analog properties become features rather than bugs.⁴

Ring oscillators are the simplest example: chain an odd number of inverters in a loop and the output oscillates at a frequency determined by the propagation delay through the chain. That frequency varies with temperature, voltage, and manufacturing process — which makes a ring oscillator a thermometer, a voltage sensor, or a silicon fingerprint. Bend the FPGA physically and the frequency shifts, possibly through the piezoresistive effect, making it a strain gauge.⁴

Time-to-digital converters exploit propagation delay more precisely. A tapped delay line — a chain of elements where each tap captures the signal after one additional gate delay — can time events to about 10 picoseconds resolution. This is used in laser ranging, particle physics, and slope ADCs. The technique works because you're not measuring time digitally (counting clock cycles) but exploiting the analog propagation characteristics of the silicon itself.

Digital-to-analog conversion is surprisingly capable: a delta-sigma modulator with a low-pass filter can reach 10-12 bit resolution. Push further — use an output serializer running at 600 MHz — and you can transmit FM radio from a single I/O pin. Use the multi-gigabit transceivers and you reach RF frequencies suitable for software-defined radio.⁴

Physical unclonable functions (PUFs) are perhaps the most elegant application: a circuit that produces a unique ID for each physical chip, like a silicon fingerprint. The ID must be stable across temperature and voltage changes but vary between chips — resistant to environmental variation, sensitive to manufacturing variation. If reliable, a PUF provides a cryptographic key source that requires no battery backup and can't be extracted by reading memory, because the key is an emergent property of the specific silicon, not stored anywhere.⁴

Tiny GPUs and the Demoscene Connection

The demoscene tradition of creative constraint reaches its logical endpoint when you're not just optimizing software for fixed hardware but designing the hardware itself. The DMC-1 is a GPU designed for 1990s-style games — Doom, Comanche, Quake — implemented on a Lattice iCE40 UP5K with 5,000 lookup tables and a 25 MHz clock. It renders perspective-correct textured polygons, terrains, and lightmaps, producing each screen column in real time without a frame buffer.⁵

The key algorithmic insight is avoiding per-pixel division for perspective-correct texturing. Division is expensive in hardware: either slow (many cycles) or large (many LUTs). For vertical walls and horizontal floors, the division can be precomputed in a small table because the divisor is bounded by the screen dimensions. For arbitrary-angle surfaces, the divisor turns out to be the dot product of two unit vectors (the plane normal and the view ray), which is bounded to [-1, 1] — also small enough for a table. This single trick enables perspective-correct texturing on arbitrary polygons without any division hardware.⁵

Another group took a different approach: convert a C++ ray tracer directly into digital logic using PipelineC and CflexHDL, which parse C code and automatically generate pipelined hardware. On a Xilinx Artix-7, the system ray-traces at 1080p 60FPS with a 400-stage pipeline, achieving 70 GFLOP/s at under 1 watt — about 50x more power-efficient than a modern CPU running the same algorithm. The entire flow from C code to FPGA bitstream uses exclusively open-source tools.²

Both projects connect to the same impulse that drives the demoscene: if you understand the hardware deeply enough — or if you're willing to build the hardware — you can achieve things that seem impossible within the constraints. The DMC-1 renders Quake levels on a chip that costs a few dollars and uses less power than an LED. The PipelineC raytracer achieves GPU-class performance without a GPU. The constraint isn't "make it small" or "make it fast" but "make it exist where it has no right to."

Bus Protocols and Building SoCs

The gap between "I can blink an LED" and "I have a working computer on an FPGA" is mostly filled by bus protocols. AXI (Advanced eXtensible Interface), ARM's bus standard, is the lingua franca of modern FPGA SoC design — and learning it properly matters because the most widely available examples are broken.

Dan Gisselquist of ZipCPU has documented this with barely contained exasperation: Xilinx's example AXI Stream master violates the handshaking specification. Their AXI-lite slave template mishandles backpressure. Their full AXI slave doesn't implement addressing correctly. These are the designs that Xilinx's own training materials tell you to start from, and the bugs are fundamental enough that formal verification catches them immediately.⁶

The core of AXI is the handshake: a VALID signal from the sender, a READY signal from the receiver, and a transfer occurs when both are asserted simultaneously. The critical subtlety is that VALID must not depend on READY — a sender must be willing to assert that data is available before knowing the receiver is ready to accept it. Violating this creates combinational loops or deadlocks, which is exactly what Xilinx's examples do. The fix is the skidbuffer — a one-entry FIFO that decouples the two sides, allowing both to operate at full throughput without combinational dependency. Without a skidbuffer, you'll never exceed 50% throughput without violating the spec.⁶

For most designs, AXI-Lite is sufficient. The jump from AXI-Lite to full AXI buys you burst transfers — multiple data beats per address handshake — but for slaves, the throughput difference is trivial (maybe 2 clock cycles of latency per transaction). For masters, bursts matter more, especially for DMA engines that need to blast data across the bus. The addressing is where things get genuinely hard: FIXED, WRAP, and INCREMENT modes interact with the AxSIZE field in ways that even ASIC designers frequently get wrong.

Gisselquist's proposed course on formally verifying AXI components reveals the real structure of the protocol: you start with handshaking, add counters for outstanding requests (AXI-Lite), add burst-length tracking (AXI full), then handle write channel synchronization (address and data arrive on separate channels), exclusive access (atomic operations), and finally the FIFO challenge — verifying components that contain FIFOs, which are simple in isolation but dramatically complicate inductive proofs. The progression is a ladder of increasing complexity where each rung requires the one below it.

Building an actual SoC from these pieces is where Python HDL frameworks like Migen (and its successor Amaranth) and LiteX shine. whitequark's tutorial on implementing a simple SoC in Migen demonstrates the mechanics: wrap an OpenRISC CPU core, write GPIO and UART peripherals as Migen modules, wire them together through a Wishbone bus interconnect, compile the whole thing with Yosys and IceStorm, and load it onto a $20 iCE40 board. The Wishbone bus — AXI's simpler, older cousin — uses a straightforward request-acknowledge handshake that's much easier to implement correctly, though it lacks AXI's burst capabilities.⁷

The real insight from SoC-building tutorials isn't the bus protocol or the peripheral design — it's that a computer is shockingly few components. A CPU core, a bus, a ROM, a UART, and some GPIO. Wire them together correctly and you have a machine that runs C programs. The open hardware movement has driven this understanding from the province of chip companies into the hands of hobbyists, and bus protocol knowledge is the practical skill that makes the difference between reading about SoC design and actually doing it.

The IEEE 754 Story

The importance of getting hardware right shows up starkly in the history of floating-point arithmetic. Before IEEE 754, every computer line had its own floating-point format with its own rounding behavior. Some machines had numbers that behaved as nonzero during comparison but as zero during division. Multiplying by 1.0 could truncate your last four bits. You could get zero from X - Y even when X and Y were huge and different. Programmers coped with bizarre incantations like X = (X + X) - X inserted at critical points to work around specific hardware quirks.⁸

William Kahan, a Berkeley professor who had analyzed commercially significant arithmetic anomalies, was recruited by Intel to design the arithmetic for the i8087 coprocessor. The key insight was that Intel expected to sell vast numbers of these chips, so "best" meant "best for a market much broader than anyone else contemplated" — including the many programmers who had never taken a numerical analysis class. Every design decision had to serve both naive users (who needed sane defaults) and experts (who needed portable proofs of correctness).⁸

The most contentious feature was gradual underflow — subnormal numbers that fill the gap between zero and the smallest normalized float. Without them, there's a chasm many orders of magnitude wide where numbers suddenly flush to zero, causing subtle failures in code that was never designed to detect them. Kahan had a method to implement gradual underflow without slowing down normal operations, but Intel's confidentiality obligations prevented him from disclosing it to the standards committee. The battle over gradual underflow lasted years, with DEC leading the opposition, until they commissioned a respected error analyst to assess it — and he concluded it was the right thing to do.⁸

The IEEE 754 standard was adopted by Intel, AMD, Motorola, National Semiconductor, Zilog, and many others even before it was officially canonized — it became a de facto standard before becoming a de jure one. Kahan received the Turing Award in 1989. The standard's interlocking design is difficult to implement correctly, which is part of why it succeeded: once you invest the effort to conform, the payoff in portable numerical software is enormous. It's one of the rare standards where, as Kahan put it, "sleaze did not triumph."

FPGA Design for Software Engineers, Part 1 by walknsqualk.com — source ↩
3D raytraced game with open source C to FPGA toolchain by Victor Suarez Rovere and Julian Kemmerer — source ↩ ↩²
A Bug-Free RISC-V Core without Simulation by Tom Verbeure — source ↩ ↩²
Unconventional uses of FPGAs by AlexLao512 — source ↩ ↩² ↩³ ↩⁴
tinygpus/README.md by Sylvain Lefebvre — source ↩ ↩²
Learning AXI: Where to start? by ZipCPU — source ↩ ↩²
Implementing a simple SoC in Migen by whitequark — source ↩
An Interview with the Old Man of Floating-Point by Charles Severance — source ↩ ↩² ↩³

Linked from

Hardware And Digital Design Overview
FPGA Design is the section's most technically rich article: thinking in parallel (no instruction pointer, everything runs simultaneously), formal verification proving designs correct for all possible inputs, ring oscillators as thermometers, time-to-…

Open in stacked reader →