The Low-Rank Alignment Control Surface

The Low-Rank Alignment Control Surface

A technical overview of the geometry-to-function track: low-rank alignment updates, role gates, and the binding boundary.

This microsite tracks one research program: what small alignment updates are geometrically, and what behaviors they can actually control.

The motivating observation is that alignment / preference post-training does not appear to rewrite a model uniformly. In the studied runs, the useful update concentrates into a thin, low-rank surface on attention projections. The question is whether that surface is only a geometric curiosity, or whether it can be used as a reliable control mechanism for model behavior.

One-Screen Summary

Layer Question Current answer
Geometry What does an alignment update look like? A low, task-intrinsic stable-rank surface, flat across tested widths and replicated from GPT-NeoX-style to Qwen/Llama-style architectures.
Function Can that surface implement a role-conditioned behavioral gate? Yes: contrastive post-training can suppress forbidden primitive actions, and the suppression reaches free generation.
Boundary Does the learned gate compose to unseen role/permission combinations? No clean positive so far: the model tends to learn role-specific gates rather than reusable permission parts.
Open lever What remains to test? RPCG11c fixed primitive frequency but exposed an always-OPEN collapse. The remaining input-side lever is class-weighted CE: keep corpus/sampling unchanged, but remove the OPEN-class loss shortcut.

The title-level thesis is:

Contrastive post-training can install behavioral primitive role gates, but does not automatically induce a compositional binding layer from custom roles to canonical permission vectors.

Terms

Low-rank alignment control surface. The part of weight space that the alignment update actually uses. It is measured by stable rank, not by raw parameter count.

Role-conditioned capability gate. A trained mechanism that says, for the active role, which primitive actions are open and which should be declined.

Compositional binding. The desired next step: a new role should work because the model maps it to a canonical permission vector, not because that exact role was memorized during training.

Evidence Chain

  1. Geometry floor. Lazy-rudder / LRS1 measure a stable-rank floor for alignment updates. The floor is low and does not fall with width in the tested Qwen2.5 scale sweep.
  2. Gate installation. RPCG3 shows that the surface can implement a role-conditioned primitive gate, but only with a contrastive objective plus an anchor. Positive-only finetuning is not enough.
  3. Localization. RPCG5 shows the gate is low-rank within layers but depth-diffuse across layers.
  4. Behavior. RPCG8 shows the gate changes free generation, not just probe logits.
  5. Composition boundary. RPCG7, RPCG9, RPCG10, RPCG11, and RPCG11c all probe whether the model learns reusable permission parts. The failure modes differ, but the common boundary remains: fitting a local marginal is easier than composing unseen combinations.

Current State

The strongest positive claim is narrow but real: small, contrastively trained updates can install behavioral primitive gates.

The strongest negative claim is also narrow: the tested text-side recipes do not produce a robust role-to-permission binding layer. The program has varied objective, input format, primitive basis, nested training distribution, and primitive frequency. Those changes improved parts of the system, but did not yet solve composition.

RPCG11c is the latest rung. It kept RPCG11’s nested, prompt-injection-like setup and balanced the trained role lattice so OBEY, USE, and QUOTE each received equal permission pressure. That removed the rare-QUOTE confound, but the result was still an informative VOID: coverage was rescued, decline behavior collapsed.

Failure-mode progression

RPCG7-10 Per-spec memorization Training roles fit, but held-out permission combinations do not compose.
RPCG11 Rare-primitive collapse Nested contexts work for common primitives; under-trained QUOTE fails coverage.
RPCG11c Always-OPEN collapse Primitive coverage is fixed, but forbidden-decline behavior falls below threshold.

Latest Technical Result: RPCG11c

RPCG11c asked whether RPCG11 was mostly a data-balance problem. The answer was no, but in a useful way: primitive balancing fixed the under-trained primitive and exposed a different local optimum.

Technical choice RPCG11c setting Outcome
Primitive basis OBEY / USE / QUOTE Keeps RPCG10’s grounded provenance basis.
Role lattice Five trained specs where each primitive is permitted in 3 of 5 specs Per-primitive OPEN coverage is essentially full in every natural and structured cell, including held-out cells.
Prompt shape Outer role policy plus a context-wrapper inner attempt Keeps RPCG11’s nested-context distribution and latent candidate recovery.
Trap discipline Deterministic min-overlap shuffled map Trap C1 coverage fell to ~0.58, cleaner than RPCG11’s 0.639 knife-edge.
Sanity gates convergence, low stable rank, quiet baseline, trap collapse All green: the VOID is not methodological.

The decisive failure is the forbidden side of the gate. RPCG11c learned to open nearly everything: forbidden-decline rates were only 0.000 to 0.167, far below the 0.667 preregistered threshold. In other words, frequency balancing rescued per-primitive coverage but produced an OPEN-class marginal collapse, not a role-to-permission map.

Five input-side levers have now been tested:

Lever varied Representative path Resulting failure mode
Objective RPCG7 -> RPCG9 Per-spec memorization.
Format RPCG7 -> RPCG9 Per-spec memorization survives explicit bit-vectors.
Primitive basis RPCG9 -> RPCG10b Grounded primitives fit better but do not compose.
Training distribution RPCG10b -> RPCG11 Nested contexts expose rare-primitive coverage collapse.
Primitive frequency RPCG11 -> RPCG11c Balanced coverage exposes OPEN-class marginal collapse.

The remaining narrow input-side test is RPCG11d: class-weighted CE. It keeps the RPCG11c corpus and sampler unchanged, but weights OPEN and DECLINE losses inversely to their class frequency so the two-class loss no longer rewards the always-OPEN solution. If that still finds another local optimum instead of reusable binding, the next serious levers are architectural or out-of-band tensor policy inputs rather than more prompt formatting.

How To Read This Site

Papers on this track

This is one microsite for the whole track, not one per paper; it grows as the track adds papers (next: the canonical-policy-IR / binding-layer program).