The Low-Rank Alignment Control Surface

The Low-Rank Alignment Control Surface

Geometry → function — what alignment tuning's control surface looks like, and what it does.

Alignment / preference post-training does not rewrite a language model uniformly. It concentrates its weight update into a thin, low-rank surface on the attention projections. This research track studies that low-rank alignment control surface from two sides.

Geometry. The alignment update has a task-intrinsic stable rank — a low, roughly constant rank, flat across model width and stable from GPT-NeoX to Llama-style architectures. What the surface looks like.

Function. On a controlled petri-dish, that surface implements a role-conditioned capability gate — a mechanism that, conditioned on the active role, suppresses the actions the role may not perform. The gate is installable, low-rank, depth-diffuse, and reaches free-generation behavior — but does not compositionally generalize to roles defined by unseen permission combinations. What the surface does.

The track’s title-level thesis:

Contrastive post-training can install behavioral primitive role gates, but does not automatically induce a compositional binding layer from custom roles to canonical permission vectors.

The two halves

Papers on this track

This is one microsite for the whole track, not one per paper; it grows as the track adds papers (next: the canonical-policy-IR / binding-layer program).