Shader Programming

A GPU is not a fast CPU. It's a different kind of computer — thousands of simple cores that execute the same instruction simultaneously across massive swaths of data. Understanding this architecture is the price of admission for graphics programming, and increasingly for machine learning. The shading languages that target these machines are superficially C-like but encode fundamentally different assumptions about parallelism, memory, and control flow. And the newest development — automatic differentiation baked into the language itself — is collapsing the wall between rendering and learning.

The Execution Model

GPUs run on Single-Instruction-Multiple-Data (SIMD) architecture. On AMD hardware, a kernel executes across groups of 32 or 64 work-items called wavefronts; NVIDIA calls them warps. Every thread in a wavefront executes the same instruction at the same time. When threads diverge — one takes the if branch, another takes the else — both branches execute for the entire wavefront, with inactive threads masked off.¹ This means a ray marcher where some rays hit glass and others hit diffuse surfaces will execute both material shaders for all rays in the wavefront, even though each ray only needs one.

This is the SIMT (Single Instruction Multiple Threads) model, and it's why GPU programming rewards uniformity. The best GPU code has all threads in a wavefront doing the same thing. The worst has every thread doing something different. Path tracing is naturally divergent — every bounce can send rays to different materials — which is why real-time path tracers are hard even on hardware with enormous theoretical throughput.

The Language Fragmentation

The shader language ecosystem is fragmented by design, because there are multiple competing GPU vendors and graphics APIs.¹ The big four:

HLSL (High Level Shading Language) — Microsoft's language for DirectX. Arguably the most fully-featured: namespaces, template generics, overloadable operators, #include. Separates samplers from textures, which GLSL doesn't.
GLSL (OpenGL Shading Language) — Khronos's language for Vulkan and OpenGL. Uses vec4 instead of float4, couples samplers with textures, but is otherwise similar. The version used for WebGL (GLSL ES 1.0) is significantly more limited.
MSL (Metal Shading Language) — Apple's language for Metal. C++14-based, only targets Apple hardware.
WGSL (WebGPU Shading Language) — The newest entrant, designed for the WebGPU standard. Safer by design, with explicit memory annotations.

Despite the fragmentation, the languages are similar enough that translating between them is mostly mechanical — rename frac to fract, swap the multiplication order, adjust the binding syntax. The real complexity is in the intermediate representations: SPIR-V (Khronos), DXIL (DirectX 12), and vendor-specific ISAs that the driver compiles to. SPIR-V has become the de facto hub — tools like dxc, glslang, SPIRV-Cross, and naga can route shaders between almost any pair of languages via SPIR-V.

What all these languages share is more important than what divides them: a standard library of vector math intrinsics (dot products, matrix multiplications, transcendentals), threadgroup synchronisation primitives, and a mental model where each invocation of your program handles one vertex, one pixel, or one compute work-item.

Shadertoy as Laboratory

Shadertoy deserves mention not because it's a language but because it transformed how people learn and experiment with shaders. A Shadertoy shader is a fragment shader with minimal inputs — pixel coordinate, time, mouse position — and a single output colour. From this constraint, people have built playable Doom levels, photorealistic path tracers, and visualisations of fluid dynamics.² The site was founded by demosceners, and the community includes graphics researchers, game developers, and hobbyists. It's the REPL for GPU programming — instant feedback, zero setup, global sharing.

Shadertoy's real contribution is lowering the barrier. You don't need to set up a Vulkan pipeline, manage swap chains, or handle resource binding. You write a function from pixel coordinate to colour, and the GPU runs it. This directness is why so many SDF techniques were developed and refined there — the overhead of traditional graphics APIs would have killed the experimental momentum.

The Differentiable Turn

The most significant development in shader programming in the last five years isn't a new language feature or a faster GPU. It's the realisation that rendering pipelines can be made differentiable — that you can propagate gradients backward through the entire light transport simulation.

NVIDIA's Slang language embeds automatic differentiation as a first-class citizen: in the type system, the IR, the optimisation passes, and the IDE tooling.³ Applying bwd_diff to a function yields another function that computes the backward derivative. The type system tracks differentiability, catching common mistakes like accidentally calling non-differentiable functions from differentiable contexts. Higher-order differentiation — differentiating a derivative — is supported and enables advanced algorithms like warped-area sampling.

The practical impact is that existing real-time renderers can be made differentiable without rewriting them. The Falcor research framework's path tracer was made differentiable by reusing over 5,000 lines of pre-existing Slang shader code.³ This means you can optimise scene parameters (geometry, materials, lighting) to match a target image — inverse rendering — or train small neural networks inline within the rendering pipeline, like neural radiance caches that accelerate global illumination.

SIMT vs. Tensors

Slang's designers articulate a distinction that's crucial for understanding why shading languages and ML frameworks serve different needs.³ PyTorch and NumPy operate on whole tensors — a reduce-sum takes one line, a large matrix multiply is a single operation. This model is perfect for feed-forward neural networks where operations are uniform. But it's terrible for path tracing, where each ray hits a different surface and executes different logic.

Shading languages occupy the other end: the SIMT model specifies a program for a single element. Control flow divergence is natural — each ray can branch differently. A variable-step ray marcher is elegant in SIMT but devolves into complex active-mask-tracking code in a tensor framework.

These models are complementary, not competing. The challenge is letting them interoperate. Slang can emit code for HLSL, GLSL/SPIR-V, CUDA/OptiX, and scalar C++. You can train with PyTorch optimisers and deploy on Vulkan without rewriting the shading code. A single representation in one language for both training and inference is the goal — and it's becoming practical.

Bindless and the Uber Shader Philosophy

There's a countercurrent to the explosion of shader permutations that engines like Unity and Unreal generate. id Software's approach in DOOM Eternal is radically different: the entire game runs on roughly 500 pipeline states and a handful of massive uber shaders.⁴ Instead of generating unique shaders for every combination of material features, they combine everything into a few monolithic shaders with runtime branching. This sounds like it should be slow — divergence in a warp is expensive — but id makes it work through two key architectural decisions.

First, fully bindless resources. Every texture in the scene is bound at once in a large descriptor table, indexed dynamically by material parameters passed through uniforms. No texture binding changes between draw calls. Second, all geometry lives in a single unified vertex buffer, with each mesh just an offset into it. A compute shader can merge draw calls from unrelated meshes into a single indirect draw, dramatically reducing CPU submission overhead. The DOOM Eternal graphics study reveals that even screen-space reflections run inside the forward uber shader rather than in a separate pass — trading register pressure for memory bandwidth savings.⁴

This is philosophically opposite to the "small composable shaders" approach, and it works because id controls the entire art pipeline. When you can dictate that all materials will fit one shading model, you avoid the combinatorial explosion. The tradeoff is artist flexibility. Most studios can't make that trade. But the performance results speak for themselves: DOOM Eternal runs at 60+ fps on modest hardware while looking like it has no business running that fast. Understanding GPU pipeline architecture — why draw call overhead matters, why divergence is the enemy — makes it clear why this approach works.

SPIR-V: The Intermediate That Isn't

SPIR-V was supposed to be the LLVM of shading languages — a portable binary intermediate representation that would let you write in any source language and target any GPU. The reality is more complicated. Dzmitry Malyshau (kvark), who built the Naga shader translation library for wgpu, documented a litany of design decisions in SPIR-V that make it actively hostile to anyone trying to use it as a genuine portable format rather than as a driver compiler input.⁵

The control flow representation is the deepest problem. SPIR-V represents control flow as a graph with merge and continue annotations that are supposed to preserve the structure of the source program. But the merge block — where two branch paths rejoin — doesn't necessarily mean both paths actually converge there. Naga's contributor eventually concluded "you can't really rely on the merge block as a point where two paths actually merge." Reconstructing structured control flow from SPIR-V's graph (which Naga needs to emit WGSL, HLSL, or MSL) turned into a research problem full of "sacred knowledge and edge cases."⁵

Other issues compound this: types must be globally unique (you can't have two int32 types with different names), OpFmod doesn't match any standard fmod definition, struct layouts are "open-ended" with no way to determine total size, and storage classes have confusing names (UniformConstant vs Uniform vs StorageBuffer). The spec is fragmented across three documents — SPIR-V format, Vulkan environment specification, and the Vulkan spec itself — making it hard to even find authoritative answers about edge-case behavior.

Malyshau's conclusion is sharp: SPIR-V is a good format for what it was made for (driver compilers) but a poor format for intermediate portable representation of shaders. The WebGPU group's decision to create WGSL rather than adopt SPIR-V directly — once controversial — looks increasingly vindicated by the experience of everyone who has tried to build robust tooling around SPIR-V as a userland format.⁵

Gaussian Splatting and the New Primitives

The most surprising development in real-time rendering may be Gaussian splatting — representing scenes not as triangles or implicit surfaces but as clouds of 3D Gaussian ellipsoids, each with a position, covariance matrix, opacity, and view-dependent colour encoded in spherical harmonics.⁶ It emerged from the neural radiance field community but shed the neural network entirely: the "Gaussians" are just data, optimized through gradient descent but rendered through classical rasterization.

The rendering is tile-based: project each Gaussian to screen space, bin it into tiles, sort by depth within each tile, then blend front-to-back until opacity saturates. Tellusim's implementation achieves nearly twice the framerate of the original CUDA reference by blending eight pixels per GPU thread instead of one — amortising the data loading overhead across neighbours and reducing warp divergence from the early termination condition. The result is real-time novel view synthesis at 4K resolution: 95 fps on a 3090 for a scene with 5.8 million Gaussians.⁶

What's interesting about splatting from a shading language perspective is how it bypasses the entire traditional pipeline. There are no vertices, no triangles, no rasterizer, no fragment shader in the conventional sense. It's pure compute: sort, project, blend. The signed distance functions community showed that you could render without meshes by marching through scalar fields. Gaussian splatting shows you can render without ray marching by projecting statistical primitives. The assumption that rendering = triangles → rasterizer → fragments is increasingly just one option among several.

What This Means

The convergence of rendering and machine learning isn't a marketing trend — it's a structural shift. Data-driven techniques are appearing everywhere: neural texture compression, learned denoising (DLSS), neural radiance fields, Gaussian splatting. All of these need gradients flowing through graphics operations. Previously this required hand-derived derivatives — tedious, error-prone, and hostile to iteration. Differentiable shading languages automate this and bring 10x training speedups for small-network workloads by fusing forward and backward passes, avoiding the overhead of PyTorch's serialise-checkpoint-launch cycle.

The long-term implication is that the boundary between "renderer" and "neural network" is dissolving. A future renderer might use learned BRDFs, neural importance sampling, and optimised scene representations — all within the same language, the same compilation pipeline, the same type system. Path tracing gave us physically based images. Differentiable path tracing will let us work backward from images to physics.

A Review of Shader Languages by Alain Galvan — source ↩ ↩²
Casual Shadertoy Path Tracing 1 by demofox — source ↩
Differentiable Slang: A Shading Language for Renderers That Learn by NVIDIA — source ↩ ↩² ↩³
DOOM Eternal — Graphics Study by Simon Coenen — source ↩ ↩²
Horrors of SPIR-V by Dzmitry Malyshau — source ↩ ↩² ↩³
Hello Splatting by Tellusim — source ↩ ↩²

Linked from

Animation Principles
These are perceptual constants that apply whether you're hand-drawing cel animation, keyframing a 3D rig, or writing a Shader Programming program that procedurally generates particle effects.
Gpu Pipeline Architecture
This is the SIMT model that shader programming is built on, and it has a specific consequence: divergence is expensive.
Graphics And Rendering Overview
Shader Programming covers the languages that target this machine, from HLSL through GLSL to the differentiable turn where NVIDIA's Slang bakes automatic differentiation into the type system.
Hash Function Design
The multiply-xorshift construction also shows up in PRNG design (SplitMix64 is used to seed other generators), in shader programming (Wellons' functions have been adapted for GLSL noise generation), and in cryptography as capability (where the design…
Path Tracing
Differentiable rendering propagates gradients through the light transport simulation, enabling optimization of geometry, materials, and lighting to match a target image.

Open in stacked reader →