The Low-Rank Alignment Control Surface

The Low-Rank Alignment Control Surface

A technical overview of the geometry-to-function track: low-rank alignment updates, role gates, and the binding boundary.

This microsite tracks one research program: what small alignment updates are geometrically, and what behaviors they can actually control.

The motivating observation is that alignment / preference post-training does not appear to rewrite a model uniformly. In the studied runs, the useful update concentrates into a thin, low-rank surface on attention projections. The question is whether that surface is only a geometric curiosity, or whether it can be used as a reliable control mechanism for model behavior.

One-Screen Summary

Layer Question Current answer
Geometry What does an alignment update look like? A low, task-intrinsic stable-rank surface, flat across tested widths and replicated from GPT-NeoX-style to Qwen/Llama-style architectures.
Function Can that surface implement a role-conditioned behavioral gate? Yes: contrastive post-training can suppress forbidden primitive actions, and the suppression reaches free generation.
Boundary Does the learned gate compose to unseen role/permission combinations? No clean positive so far: the model tends to learn role-specific gates rather than reusable permission parts.
Open lever What remains to test? Frequency-balanced nested training next; after that, likely structured policy side inputs or explicit binding machinery.

The title-level thesis is:

Contrastive post-training can install behavioral primitive role gates, but does not automatically induce a compositional binding layer from custom roles to canonical permission vectors.

Terms

Low-rank alignment control surface. The part of weight space that the alignment update actually uses. It is measured by stable rank, not by raw parameter count.

Role-conditioned capability gate. A trained mechanism that says, for the active role, which primitive actions are open and which should be declined.

Compositional binding. The desired next step: a new role should work because the model maps it to a canonical permission vector, not because that exact role was memorized during training.

Evidence Chain

  1. Geometry floor. Lazy-rudder / LRS1 measure a stable-rank floor for alignment updates. The floor is low and does not fall with width in the tested Qwen2.5 scale sweep.
  2. Gate installation. RPCG3 shows that the surface can implement a role-conditioned primitive gate, but only with a contrastive objective plus an anchor. Positive-only finetuning is not enough.
  3. Localization. RPCG5 shows the gate is low-rank within layers but depth-diffuse across layers.
  4. Behavior. RPCG8 shows the gate changes free generation, not just probe logits.
  5. Composition boundary. RPCG7, RPCG9, RPCG10, and RPCG11 all probe whether the model learns reusable permission parts. The consistent failure mode is that fitting seen roles is easier than composing unseen combinations.

Current State

The strongest positive claim is narrow but real: small, contrastively trained updates can install behavioral primitive gates.

The strongest negative claim is also narrow: the tested text-side recipes do not produce a robust role-to-permission binding layer. The program has varied objective, input format, primitive basis, and nested training distribution. Those changes improved parts of the system, but did not yet solve composition.

RPCG11 is the latest rung. It changed the corpus shape from flat permission specs to nested prompt-injection-like contexts. The run was healthy, but the preregistered result was an informative VOID: the rare QUOTE primitive did not clear the in-distribution coverage threshold. That points to the next controlled test: a frequency-balanced nested corpus.

How To Read This Site

Papers on this track

This is one microsite for the whole track, not one per paper; it grows as the track adds papers (next: the canonical-policy-IR / binding-layer program).