Goodnight Wiki / GPU-Driven Rendering

GPU-Driven Rendering

For decades, the CPU told the GPU what to draw. The CPU decided which objects were visible, selected their LOD levels, sorted them, built command buffers, and submitted draw calls. The GPU was the obedient rasterizer. GPU-driven rendering inverts this: the GPU makes the rendering decisions itself, using compute shaders to cull, sort, and dispatch work without round-tripping through the CPU.

The motivation is straightforward. A CPU-driven engine with a million objects would need the CPU to process each one every frame — visibility testing, LOD selection, draw call generation. At some point the CPU becomes the bottleneck, no matter how fast the GPU is. The fix is to move those decisions onto the machine that's already looking at the data.

The Nanite Paradigm

Epic's Nanite in Unreal Engine 5 is the highest-profile implementation of GPU-driven rendering, and Eduardo Lopez Rodriguez's RenderDoc reverse-engineering reveals the engineering in detail.1

The pipeline starts with GPU instance culling. A compute shader tests every instance against the camera frustum and a hierarchical depth buffer (HZB) from the previous frame. Visible instances produce a list of candidate clusters — groups of 128 triangles each. A second culling pass operates at the cluster level, using persistent culling to determine how many clusters survive. The system writes out separate counts for clusters destined for hardware rasterization versus software rasterization, with indirect dispatch arguments that let the GPU fire off both paths without CPU involvement.1

The LOD system is particularly clever. Clusters appear and disappear as the camera moves, with some "magical incantations" ensuring seamless stitching between clusters at different detail levels. The selection is entirely GPU-side — the HZB is regenerated mid-frame with current depth data, and a second culling pass catches geometry at the edges of the previous depth buffer that might have been incorrectly rejected.1

The split between hardware and software rasterization is critical. Over 90% of Nanite's geometry goes through the compute-based software rasterizer, because hardware rasterizers choke on pixel-sized triangles — the 2x2 quad shading model in the GPU pipeline wastes 75% of compute on a single-pixel triangle. The compute rasterizer processes one triangle per thread in 128-thread groups matching the cluster size. The remaining ~10% of clusters — those that are large on screen — still use hardware rasterization, which is faster for bigger triangles.1

The Visibility Buffer

The output of all this culling and rasterization is a visibility buffer: a single R32G32_UINT texture storing cluster ID (25 bits), triangle ID (7 bits), and 32-bit depth per pixel. No materials. No textures. Just "which triangle is closest at this pixel."1

This is a deeper decoupling than deferred shading. Where deferred separates materials from lighting, the visibility buffer separates geometry from materials entirely. Material evaluation happens in a later pass that reads the visibility buffer, reconstructs the triangle's attributes by intersecting the camera ray with the stored triangle, and evaluates the appropriate material. Each pixel's material runs exactly once — no overdraw, no wasted texture fetches for occluded surfaces.1

The material pass uses another GPU-driven trick: a material classification compute shader analyzes the visibility buffer in 64x64 tiles, determining which materials are present in each tile. Then for each material, a fullscreen quad is drawn with the material depth texture as an early-Z test — tiles without that material get their vertices set to NaN and are discarded by the hardware. This means only the relevant pixels actually execute each material's shader.1

Bevy's Open-Source Implementation

JMS55's virtual geometry implementation in the Bevy engine is a public replication of many Nanite ideas, with every decision documented. Their software rasterizer uses the same fundamental approach: screen-space AABB size determines software vs. hardware raster, atomicMax provides depth testing in the compute path, and indirect dispatch coordinates everything without CPU involvement.2

The Bevy implementation exposes practical challenges that Nanite's polished presentation glosses over. Meshlet sizes (64 triangles max) have to balance between wavefront utilization and overdraw. Compressed per-meshlet vertex data — quantized positions with per-meshlet bounding boxes — reduces memory by roughly half but adds decode cost. LOD selection heuristics need to account for both screen-space size and normal variation to avoid popping artifacts. None of these are solved problems; they're engineering tradeoffs that depend on your content and target hardware.2

GTA V: An Earlier Generation

Adrian Courrèges' graphics study of GTA V shows GPU-driven techniques in a pre-Nanite engine. GTA V uses compute shaders for LOD selection and culling — deciding per-object whether to render and at what detail level. This was the step that differentiated the PC/PS4 rendering from the PS3 version, which lacked compute shader support and had to do this work on the Cell processor's SPUs.3

The game also demonstrates several GPU-driven optimizations that are easy to overlook: the shadow pipeline's "early out" texture that skips expensive blur computation for fully-lit pixels, front-to-back rendering order to maximize early-Z rejection, and alpha stippling for LOD transitions (skipping every other pixel in a checkerboard pattern to make opaque meshes appear semi-transparent during LOD crossfades). These are small things individually, but they represent the GPU-driven philosophy: make the GPU do more work per draw call so the CPU issues fewer of them.3

Precomputed Lighting as the Opposite Extreme

Jonathan Blow's approach to precomputed lighting in The Witness sits at the other end of the complexity spectrum but shares the core motivation. Rather than build an elaborate real-time GI system, The Witness walks a virtual camera over every lightmap texel and renders the scene into a hemicube using the existing game renderer. It's using the GPU to do offline computation that would traditionally require separate ray tracing or radiosity code — the system "automatically matches" the real-time rendering because it is the real-time rendering, just used as a precomputation tool.4

This is a different flavor of GPU-driven: not driving rendering decisions at runtime, but using the GPU's rendering capabilities as a general computation engine. The distinction matters less than it might seem. Both are about recognizing that the GPU is the fastest machine in the system and giving it more autonomy.

Where It's Going

The trend is clear: every frame, more decisions move from CPU to GPU. Culling, LOD, sorting, light assignment, material dispatch, shadow updates — all of these are becoming compute shader dispatches coordinated by indirect draw/dispatch buffers. The CPU's job is increasingly just to set up the scene description and press go.

The limiting factor isn't hardware capability but API and portability constraints. Advanced GPU-driven techniques rely on features like device-scope atomics, forward progress guarantees, and indirect dispatch that vary across Vulkan, DX12, Metal, and WebGPU. Raph Levien's experience with portable compute shaders suggests that the most aggressive techniques will remain Vulkan/DX12-only for some time, with fallbacks needed for Metal and the web.5

Footnotes

  1. A Macro View of Nanite by Eduardo Lopez Rodriguez — source 2 3 4 5 6 7

  2. Virtual Geometry in Bevy 0.15 by JMS55 — source 2

  3. GTA V - Graphics Study by Adrian Courrèges — source 2

  4. Graphics Tech: Precomputed Lighting by Jonathan Blow — source

  5. Prefix sum on portable compute shaders by Raph Levien — source

Open in stacked reader →