GPU Pipeline Architecture

A GPU is not a faster CPU. It's a different kind of machine — one designed around the assumption that you have thousands of independent things to do at once. Understanding this is the price of admission for modern graphics, and increasingly for compute. What's surprising is how much of the GPU's architecture is dedicated not to computation but to work distribution — the logistics of keeping thousands of cores busy despite wildly variable workloads.

The Life of a Triangle

NVIDIA's Christoph Kubisch wrote one of the clearest explanations of how geometry actually flows through a modern GPU, and it's worth internalizing because the mental model most programmers have — a simple pipeline where vertices go in and pixels come out — is dangerously incomplete.¹

The GPU is partitioned into Graphics Processing Clusters (GPCs), each containing Streaming Multiprocessors (SMs) and a raster engine. An SM contains the cores that actually do math, but the cores are dumb — they execute instructions dispatched by warp schedulers. A warp is a group of 32 threads that execute in lockstep. The cores don't decide what to run; the scheduler does. This is the SIMT model that shader programming is built on, and it has a specific consequence: divergence is expensive. When threads in a warp take different branches, both paths execute for all threads, with inactive ones masked off.

A triangle's journey starts with the CPU issuing a draw call that reaches the GPU's front end via a push buffer. The Primitive Distributor fans out triangle batches to GPCs. Within a GPC, the Poly Morph Engine fetches vertex data, warps run the vertex shader, and then comes the critical part: the rasterized triangle gets sent through a Work Distribution Crossbar to whichever GPC's raster engine covers the relevant screen tiles. A triangle can leave the GPC that processed its vertices. The raster engine chops it into 2x2 pixel quads — the smallest unit the GPU works with — and these quads run through the pixel shader in warps.¹

That 2x2 quad is important. It exists because the GPU needs to compute screen-space derivatives for texture filtering — how fast UV coordinates change between adjacent pixels determines which mipmap level to sample.² Helper invocations (pixels in the quad that fall outside the triangle) still execute the shader but discard their results. This is why very small triangles are catastrophic for performance: a triangle covering a single pixel still occupies a full 2x2 quad, wasting 75% of the computation.

The Small Triangle Problem

This is the central tension in modern real-time graphics. Art direction has pushed toward ever-denser geometry — photogrammetry assets can have millions of triangles per object. But GPUs were designed for the opposite scenario: comparatively few triangles, each covering many pixels. The hardware rasterizer has tile binning, quad shading, and vertex caching optimizations that all assume this. Feed it millions of pixel-sized triangles and you hit several bottlenecks at once: the primitive assembler can't keep up, quad utilization collapses, and vertex caching becomes useless.³

The Cities: Skylines 2 debacle is an instructive cautionary tale. Paavo Huhtala's RenderDoc analysis revealed that the game was rendering absurdly overdetailed models — characters with 4000-triangle teeth meshes visible only as a few pixels, foliage with full LOD at every distance — and the GPU simply couldn't keep up. The game ran worse than Cyberpunk 2077 with path tracing enabled, not because its lighting was more sophisticated, but because its geometry budget was incoherent. The rendering itself was competent Unity HDRP with clustered deferred shading; the problem was upstream, in the art pipeline's failure to manage triangle budgets.⁴

Nanite and the Visibility Buffer

Epic's Nanite system in Unreal Engine 5 is the most ambitious production solution to the small triangle problem. Eduardo Lopez Rodriguez reverse-engineered a frame using RenderDoc and the result reads like a masterclass in GPU-driven rendering.⁵

Nanite's key insight is to decouple geometry from materials using a visibility buffer — a single texture that stores cluster ID, triangle ID, and depth for each pixel. No materials are evaluated during rasterization. No textures are sampled. The entire geometry pass just figures out which triangle is closest at each pixel. Material shading happens later in a separate pass that reads the visibility buffer and evaluates the appropriate material for each pixel exactly once. If deferred rendering decouples materials from lighting, the visibility buffer decouples geometry from materials.

The rasterization itself is split: large triangles use the hardware rasterizer, but over 90% of the geometry — the millions of tiny triangles — goes through a compute shader software rasterizer. This is the counterintuitive part: for small triangles, bypassing the GPU's dedicated rasterization hardware and doing it manually in a compute shader is faster, because you avoid all the quad-shading overhead. Nanite groups triangles into 128-triangle clusters, culls aggressively using a hierarchical depth buffer from the previous frame, and dynamically selects LOD levels so that clusters seamlessly stitch together at different detail levels.⁵

JMS55's implementation of virtual geometry in the Bevy engine is a fascinating open-source parallel to Nanite. Their software rasterizer uses the same basic approach — screen-space AABB to decide software vs. hardware raster, atomicMax for depth, compute dispatch per cluster — but documents every decision and tradeoff publicly.³ Reading both together gives you a much clearer picture than either alone: Nanite tells you what's possible, Bevy tells you what's hard.

Clustered Forward Rendering

While Nanite represents the bleeding edge, most shipped games use variations of clustered forward or deferred rendering. The DOOM 2016 graphics study by Adrian Courreges remains one of the best frame breakdowns ever written, and its explanation of clustered rendering is worth studying.⁶

The idea is spatial: divide the camera frustum into a 3D grid of "froxels" (frustum voxels). DOOM uses 16x8 tiles with 24 logarithmic depth slices, creating 3072 clusters. The CPU assigns each light, decal, and cubemap probe to the clusters its volume intersects. During shading, a pixel shader determines which cluster it belongs to from its screen position and depth, looks up the list of affecting lights, and loops over them. The overhead of the CPU-side cluster assignment is small compared to the GPU savings from not evaluating lights that can't possibly affect a given pixel.

DOOM Eternal pushed this further.⁷ With hundreds of dynamic lights and thousands of decals, id Software replaced the CPU clustered culling with a GPU software rasterizer for lights and decals — projecting their bounding hexahedra into screen space using compute shaders, binning them into tiles, and depth-testing against the scene. They also eliminated the thin G-Buffer from DOOM 2016, moving screen-space reflections directly into the forward uber shader. Everything in the engine runs through a tiny set of massive uber shaders with a fully bindless resource architecture — all textures bound at once, indexed dynamically. This enables draw call merging across unrelated meshes: a compute shader creates a single indirect draw call from geometry that shares a shader even if the materials differ.

The whole engine has about 500 pipeline states total. Compare that to a typical Unreal Engine game with thousands of shader permutations. id Tech's approach trades artistic flexibility for raw performance, and at 60fps+ on modest hardware while looking gorgeous, the tradeoff works.

GPU Memory: Five Ways to Expose the Same RAM

The Vulkan memory model exposes a surprising amount of hardware reality — and some awkward virtualization. Adam Sawicki, developer of the Vulkan Memory Allocator library, documented five distinct patterns that GPUs expose for memory heaps and types, and the differences are instructive for understanding why "just allocate a buffer" is never simple on modern GPUs.⁸

Intel integrated graphics has the simplest model: one heap, all memory both DEVICE_LOCAL and HOST_VISIBLE. The GPU and CPU share the same physical RAM, so you can write directly to GPU-accessible memory via a mapped pointer. No staging copies, no transfer commands for buffers. NVIDIA discrete cards show the classic split: DEVICE_LOCAL video RAM (fast for GPU, not CPU-accessible) and HOST_VISIBLE system RAM (CPU-accessible, slow for GPU). Loading a texture means writing to a staging buffer in system RAM, then issuing a vkCmdCopyBufferToImage to move it to video RAM.

AMD discrete cards add a third heap: 256 MB of memory that's both DEVICE_LOCAL and HOST_VISIBLE — the Base Address Register (BAR) — where the CPU can directly write to a small window of video RAM over PCIe. This is ideal for per-frame uniform buffers that the CPU writes and GPU reads once. AMD's Resizable BAR (marketed as "Smart Access Memory") extends this window to the entire video RAM, eliminating the 256 MB limit. And then there's AMD's integrated graphics (APUs), which virtualize the shared physical RAM into disjoint heaps with only 256 MB marked DEVICE_LOCAL — the worst case for any engine that tries to fill DEVICE_LOCAL memory first, because 256 MB won't fit even basic render targets.⁸

The practical consequence is that any GPU memory allocator must identify which pattern it's running on and adjust its strategy accordingly. A naive "prefer DEVICE_LOCAL" policy works on discrete cards but fails on APUs. A strategy that always uses staging copies wastes time and memory on integrated GPUs where direct access is fastest. And the vendor-specific flags (like AMD's DEVICE_COHERENT and DEVICE_UNCACHED for crash debugging) add memory types that the validator will reject if you use them without enabling the corresponding extension — even though the driver reports them unconditionally.

What This Architecture Means

The GPU pipeline is optimised for a specific kind of work: uniform, parallel, bandwidth-limited computation. Everything that makes GPUs fast — warp execution, quad shading, texture caching, register-file switching for latency hiding — falls apart when the workload becomes irregular. Small triangles, divergent shading, random memory access patterns: these all push against the architecture rather than working with it.

The most interesting trend in real-time graphics is the shift from hardware-driven pipelines to GPU-driven ones. Instead of the CPU setting up draw calls and the hardware rasterizer processing them, modern engines use compute shaders for culling, LOD selection, draw call generation, and even rasterization itself. The GPU becomes the orchestrator of its own work, and the CPU's role shrinks to uploading data and issuing a few high-level dispatches. This is what Nanite does, what DOOM Eternal does, and what will increasingly become the baseline expectation for production-quality rendering.

Life of a triangle — NVIDIA's logical pipeline by Christoph Kubisch — source ↩ ↩²
Mipmap selection in too much detail by Pema Malling — source ↩
Virtual Geometry in Bevy 0.15 by JMS55 — source ↩ ↩²
Why Cities: Skylines 2 performs poorly by Paavo Huhtala — source ↩
A Macro View of Nanite by Eduardo Lopez Rodriguez — source ↩ ↩²
DOOM (2016) — Graphics Study by Adrian Courreges — source ↩
DOOM Eternal — Graphics Study by Simon Coenen — source ↩
Vulkan Memory Types on PC and How to Use Them by Adam Sawicki — source ↩ ↩²

Linked from

Gpu Driven Rendering
Over 90% of Nanite's geometry goes through the compute-based software rasterizer, because hardware rasterizers choke on pixel-sized triangles — the 2x2 quad shading model in the GPU pipeline wastes 75% of compute on a single-pixel triangle.
Graphics And Rendering Overview
GPU Pipeline Architecture explains what the hardware actually does — the life of a triangle from draw call to pixel, the 2x2 quad shading model, why small triangles are catastrophic.
Hardware And Digital Design Overview
To GPU Pipeline Architecture in the Graphics section through the shared understanding of how parallel hardware actually works.
Procedural Terrain Generation
The tension between the natural parallelism of the simulation and the shared-state constraint of the heightmap is a recurring theme in Gpu Pipeline Architecture.
Shader Programming
Understanding GPU pipeline architecture — why draw call overhead matters, why divergence is the enemy — makes it clear why this approach works.
Signed Distance Functions
The connection to GPU pipeline architecture is worth noting: voxel path tracing is naturally divergent (every ray hits different octree nodes at different depths), which fights the warp-level uniformity that GPUs want.

Open in stacked reader →