Compute Shaders
Compute shaders exist because someone looked at a GPU full of thousands of math cores and thought: why should these only draw triangles? The hardware was already general-purpose — unified shader architectures had erased the distinction between vertex and pixel processing units back in 2006. Compute shaders just ripped away the graphics plumbing and exposed the raw machine.
How We Got Here
The hardware story matters because it explains why the programming model looks the way it does. Before unified shaders, GPUs had physically separate units for vertex and pixel work. A Voodoo2 had one rasterizer and two texturing units. The GeForce 7 had dedicated vertex and pixel shader cores. This worked when scenes had few polygons covering many pixels, but it made load balancing impossible — dense geometry starved the pixel units, complex pixel shaders starved the vertex units.1
The GeForce 8800 GTX and Radeon HD 2800 (late 2006) unified everything. Gone were separate vertex and pixel units; the same cores could handle any type of work. The key leftover from the old architecture was a chunk of fast memory inside each compute unit that had been used to pass attributes between shader stages. This local memory, plus the SIMD execution model, became the foundation of the compute shader programming model.1
The Three-Level Hierarchy
Every GPU compute API — CUDA, OpenCL, Vulkan compute, DirectX compute, Metal compute, WebGPU compute — uses the same basic structure:1
- Work domain: all the work you want done
- Work groups: independent chunks that map to a single compute unit on the GPU
- Work items: individual threads within a work group
Work groups are the critical level. Within a group, threads can synchronize via barriers and share data through local memory (also called shared memory or LDS). Between groups, there are almost no guarantees — they may run in any order, on any compute unit, and cannot safely communicate except through global memory with careful atomic operations.
This maps directly to the hardware. An AMD GCN compute unit has four 16-wide SIMD units, 256 KiB of registers, and 64 KiB of local memory. A work group lands on one compute unit. The SIMD units chew through work items in bundles — 64 at a time on AMD (wavefronts), 32 on NVIDIA (warps). The programmer doesn't directly control SIMD assignment, but awareness of the warp/wavefront width matters enormously for performance.1
Latency Hiding: The Whole Point
GPUs are fast at math but terrible at memory access, relatively speaking. A GCN compute unit at 1.5 GHz could theoretically demand 67 TiB/s of data — orders of magnitude more than any memory system can deliver. The solution: hide latency by having far more work in flight than can execute simultaneously. A single compute unit might have 40 wavefronts loaded, only four actively executing. When one wavefront hits a memory stall, another is instantly swapped in. "Instantly" is literal — all the state (registers) stays resident, there's no context-switch cost.1
This has a direct consequence for how you write code. More state per thread means fewer threads can fit in registers simultaneously, which means fewer wavefronts available for latency hiding. This is "occupancy" — the fraction of maximum concurrent work the hardware can support. Memory-intensive code should minimize registers to maximize occupancy. Pure math code doesn't need high occupancy because it never stalls.1
Branching: Not as Bad as They Say
The internet wisdom that GPU branching is catastrophic is outdated. The reality is nuanced. If you're branching on a constant buffer value, the branch is nearly free — the compiler knows all threads will take the same path. If you're branching on texture data, the compiler will either execute both sides (for short branches) or do a real branch. The danger is divergence — threads in the same wavefront taking different paths, forcing both to execute.2
Jason Booth's practical framework for working with branches is worth noting: visualize them. His MicroSplat terrain shader includes debug macros that count how many branches each pixel takes and how many texture samples result. With branches enabled on a complex triplanar + stochastic shader, sample counts drop from 100 per pixel to 9–72, saving two-thirds of bandwidth. The key is branching on spatially coherent data — nearby pixels tend to share the same terrain blend weights, so the divergent region is small relative to the total.2
The real gotcha with texture sampling inside branches: you need gradient-aware samplers. The GPU computes mipmap levels from the rate of change of UV coordinates across a 2x2 pixel quad. If some pixels in the quad are inside a branch and others aren't, the derivatives become garbage. The fix is to compute derivatives (ddx/ddy) before the branch and pass them to the sampler explicitly. Fail to do this and you'll get either compilation errors, both sides executing, or silently broken LOD — depending on the platform.2 3
Beyond Graphics: Prefix Sums and the Portability Problem
Raph Levien's work on prefix sums for piet-gpu (a GPU-accelerated 2D rendering engine) exposes the gap between what compute shaders can theoretically do and what's portable in practice. The "decoupled look-back" algorithm for prefix sums requires workgroups in the same dispatch to coordinate via atomics — one workgroup publishes its partial result, the next one reads it. This needs device-scope atomic barriers. Vulkan and DX12 can handle it. Metal can't — its threadgroup_barrier with mem_device flag has threadgroup scope, not device scope. And if Metal can't do it, WebGPU can't require it.4
The result: the fastest algorithm simply cannot run on a significant fraction of GPUs. Levien's portable fallback uses tree reduction instead of decoupled look-back — about 1.5x slower, but it works everywhere. The broader lesson is that compute shader portability is a minefield of subtle capability differences between APIs, even when all of them notionally support compute.4
Forward progress is another landmine. Advanced compute patterns (like the coordination between workgroups Nanite uses for GPU-driven rendering) assume that if a workgroup is waiting for data from another, the hardware will eventually schedule the producer. No GPU spec guarantees this. ARM and Apple GPUs have been observed to fail forward progress. It can't be in the Vulkan core spec, only as an optional property.4
The Execution Pattern Matters
Michal Drobot's investigation of GCN rasterization patterns revealed that how you dispatch work affects cache performance by 10% or more. A full-screen pixel shader drawn as a two-triangle quad processes the triangles sequentially — the first triangle top-to-bottom, then the second — creating a cache-killing discontinuity where the second triangle starts. A single full-screen triangle, or a compute shader dispatch, maintains spatial locality throughout. On GCN, the numbers were stark: 87% L1/L2 cache hit rate for the two-triangle quad, 95% for a single triangle or compute.5
The practical takeaway: if your post-processing pipeline is still drawing two-triangle quads for full-screen passes, switch to either a single oversized triangle or a compute shader. It's a free 8% performance improvement for cache-bound shaders. And it's one of those things that stacks — across a post-processing chain with five or six passes, you're talking real milliseconds.5
Compute Outside the Graphics Pipeline
Fabian Giesen's final entry in his graphics pipeline series focuses entirely on compute shaders, and makes an observation that helps explain why they feel so different from the rest of the pipeline: they have almost no built-in I/O. A vertex shader receives vertex data. A pixel shader receives interpolated attributes and writes to render targets. A compute shader receives one thing: its thread index. Everything else — what data to read, what data to write, where, and in what order — is the programmer's problem.6
This is both the difficulty and the power. The thread group is the critical organizational unit, and what makes it special is Thread Group Shared Memory (TGSM) — 32 KiB of scratchpad on DX11 hardware, physically located on the shader unit itself. Because all threads in a group execute on the same unit, TGSM access requires no inter-unit coherence protocol. Only one warp can issue instructions per cycle, so there's a natural serialization of access. Barriers — group synchronization, group memory barriers, device memory barriers — provide the additional ordering guarantees when you need them, at different costs. Group sync and shared-memory barriers are cheap (pipeline flush within one unit). Device memory barriers are catastrophically expensive — they block until all outstanding memory transactions complete, easily a thousand cycles.6
Unordered Access Views (UAVs) give compute shaders their write-anywhere capability, but the "unordered" part is load-bearing. Unlike render targets, where the hardware guarantees writes complete in API order, UAV writes happen as the shader encounters them — which could be in any order relative to other threads. Atomic operations on UAVs go through dedicated atomic units that talk directly to the lowest-level cache, bypassing the per-unit cache hierarchy. Each atomic unit "owns" a hash range of addresses, ensuring no two units try to atomically modify the same location simultaneously. This is radically different from CPU coherence (MESI protocol, cache-line ownership) and explains why GPU atomics have different performance characteristics — no false sharing, but also no cache-line-level optimizations.6
The Mobile Cache Problem
Raph Levien's debugging of a curiously slow shader on Adreno 640 hardware revealed a subtlety in compute shader memory models that has no easy parallel in graphics shaders.7 His piet-gpu 2D renderer ran a fine rasterization pass that, on desktop Intel integrated graphics, took about 2ms. On Pixel 4, the same workload took 11.7ms — far worse than the hardware difference would predict.
The culprit was the shader compiler's treatment of memory coherence. The fine rasterizer reads path data and writes clip stack data to the same GPU buffer (at different, non-overlapping offsets). Because the shader both reads and writes the same buffer, the Adreno compiler conservatively bypassed the texture cache for all reads from that buffer, using ldib (load image buffer, straight to memory) instead of isam (image sample, through the TPL1 texture cache). This meant every path data read paid full memory latency rather than hitting L1 cache.7
The fix was straightforward — separate the read-only path data from the read-write clip stack into different buffers — but the diagnosis required reading shader assembly via Mesa's Freedreno disassembler and consulting with the Freedreno author. The deeper lesson connects to the Vulkan memory model: without explicit memory ordering annotations, compilers must be conservative about cache coherence. A shader that only needs relaxed semantics (each write consumed only by the same thread) pays the cost of device-scope coherence because the compiler can't know better. This is the kind of performance cliff that makes compute shader portability genuinely hard — the same SPIR-V runs fine on desktop and collapses on mobile, not because the hardware is slow but because the compiler makes different assumptions.7
WebGPU: Compute for the Web
Surma's WebGPU tutorial makes a compelling case that WebGPU's compute pipeline is actually easier to understand than its rendering pipeline — and that for non-graphics workloads, you can ignore rendering entirely.8 WebGPU compute inherits the same workgroup model as every other compute API: you define workgroup size in the shader (via @workgroup_size(64)) and dispatch workgroups from the host. The workload is modeled as a 3D grid of work items grouped into workgroups, and the shader runs once per work item.
What WebGPU adds to the picture is a clean abstraction over the adapter/device hierarchy that's genuinely useful pedagogically. The physical GPU is multiplexed by the driver to look exclusive to each application. The browser adds another layer of multiplexing so each web app feels like it has sole control. This nesting — physical device → driver → adapter → logical device — makes explicit what Vulkan leaves implicit. The default device returned by requestDevice() intentionally caps capabilities below the physical hardware (1 GiB buffers even if the GPU handles 4 GiB), ensuring that code tested on high-end hardware still runs on the median device.8
WGSL, the WebGPU shading language, compiles to whatever the underlying platform needs — HLSL for DirectX, MSL for Metal, SPIR-V for Vulkan. The language itself is deliberately conservative: no recursion, explicit memory annotations, Rust-like syntax with GLSL's math builtins. The @workgroup_size attribute and bind group layout make the programmer spell out everything that other APIs let you assume, which adds boilerplate but makes the execution model harder to get wrong.8
Footnotes
-
Introduction to compute shaders by Anteru — source ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
The weird world of shader divergence and LOD by Themaister — source ↩
-
Prefix sum on portable compute shaders by Raph Levien — source ↩ ↩2 ↩3
-
GCN Execution Patterns in Full Screen Passes by Michal Drobot — source ↩ ↩2
-
A trip through the Graphics Pipeline 2011, part 13 by Fabian Giesen — source ↩ ↩2 ↩3
-
The case of the curiously slow shader by Raph Levien — source ↩ ↩2 ↩3
-
WebGPU — All of the cores, none of the canvas by Surma — source ↩ ↩2 ↩3
Linked from
- 2D Vector Graphics On Gpu
Raph Levien's work on piet-gpu (now Vello) is arguably the most ambitious: a compute shader-based 2D rendering pipeline that processes paths entirely on the GPU.
- Gpu Driven Rendering
GPU-driven rendering inverts this: the GPU makes the rendering decisions itself, using compute shaders to cull, sort, and dispatch work without round-tripping through the CPU.
- Gpu Driven Rendering
Raph Levien's experience with portable compute shaders suggests that the most aggressive techniques will remain Vulkan/DX12-only for some time, with fallbacks needed for Metal and the web.
- Graphics And Rendering Overview
Compute Shaders is what happened when someone looked at thousands of GPU math cores and asked why they should only draw triangles — the three-level hierarchy of work items, work groups, and dispatches that powers everything from physics simulation to…